Processing .docx and .xlsx files with Python. MS Office documents are probably one of the most inconvenient and poorly formalized data sources. It’s much better to keep all the data in specialized databases or at least in wiki. But in real life, MS Office documents, especially Excel and Word, are in active use in nearly every organization. Simply because it is a flexible and easy tool that anyone can use. That’s why it’s important to know an automated ways for processing such files.
You can easily edit .docx files without any libraries. Technically it’s just a zip archive. So, you can unzip it, make a replacement in the document.xml file and make a zip it again. It’s much better than dealing with old binary .doc files. But there are even more elegant ways.
Let’s says, we need to read data from .xlsx document and generate .docx files based on some existing template. To work with .xlsx files I will use openpyxl python library.
openpyxl library for processing .xlsx files
I installed the library and opened input/table.xlsx file:
#sudo pip install openpyxl
#sudo pip2 install openpyxl
import openpyxl
wb = openpyxl.load_workbook(filename = 'input/table.xlsx')
Now we can choose with the worksheet and get value of some cell:
ws = wb.worksheets[1]
print(ws["A1"].value)
Output:
Username
Ok. We can read one cell. What about the head of the table? If we have 9 columns in the table we can do it like this:
for col in range(1, 10):
print(ws.cell(column=col, row=1).value)
Output:
Username
Surname
Attribute1
Attribute2
Attribute3
Attribute4
Attribute5
Attribute6
Attribute7
If we want to read the entire table (without head), we can do this:
first_cell_value = "1"
row = 2
table = dict()
while first_cell_value != None:
if not ws.row_dimensions[row].hidden:
table[row] = dict()
for col in range(1, 10):
table[row][col] = ws.cell(column=col, row=row).value
row += 1
first_cell_value = ws.cell(column=1, row=row).value
Note that I skiped the hidden rows.
From such a table structure we can get everything we need for analysis .
python-docx library for processing .docx files
Let’s figure out how to work with .docx files. I installed python-docx library and opened .docx template file:
#sudo pip install python-docx
#sudo pip2 install python-docx
import docx
document = docx.Document(docx = 'input/template.docx')
Here is how we can iterate through paragraphs in doc files, find a key word “USERNAME” and replace it with the required value “John”:
import re
for paragraph in document.paragraphs:
if "USERNAME" in paragraph.text:
paragraph.text = re.sub("USERNAME", "John", paragraph.text)
What if we need to read or edit the table in .docx file? You can read the table like this:
for table in document.tables:
for row in table.rows:
for cell in row.cells:
for paragraph in cell.paragraphs:
print(paragraph.text)
To insert rows into a table, we act like this:
rows_to_add = dict()
rows_to_add[1] = ['1','2','3','4']
for table in document.tables:
for row in rows_to_add:
cells = table.add_row().cells
for n in range(0,len(rows_to_add[row]), 1):
cells[n].text = rows_to_add[row][n]
And finally, save the modified template to a new file:
document.save('output/john_smith.docx')
Hi! My name is Alexander and I am a Vulnerability Management specialist. You can read more about me here. Currently, the best way to follow me is my Telegram channel @avleonovcom. I update it more often than this site. If you haven’t used Telegram yet, give it a try. It’s great. You can discuss my posts or ask questions at @avleonovchat.
А всех русскоязычных я приглашаю в ещё один телеграмм канал @avleonovrus, первым делом теперь пишу туда.
Pingback: Asset Inventory for Network Perimeter: from Declarations to Active Scanning | Alexander V. Leonov
Hi Leonov,
Great write up..
I have a question for you, Please suggest……..
I have a report template in Libreoffice writer (It has text and tables). I want to populate the Libreoffice calc data to this template.