Processing .docx and .xlsx files with Python

MS Office documents are probably one of the most inconvenient and poorly formalized data sources. It’s much better to keep all the data in specialized databases or at least in wiki. But in real life, MS Office documents, especially Excel and Word, are in active use in nearly every organization. Simply because it is a flexible and easy tool that anyone can use. That’s why it’s important to know an automated ways for processing such files.

Processing MS office files with python

You can easily edit .docx files without any libraries. Technically it’s just a zip archive. So, you can unzip it, make a replacement in the document.xml file and make a zip it again. It’s much better than dealing with old binary .doc files. But there are even more elegant ways.

Let’s says, we need to read data from .xlsx document and generate .docx files based on some existing template. To work with .xlsx files I will use openpyxl python library.

openpyxl library for processing .xlsx files

I installed the library and opened input/table.xlsx file:

#sudo pip install openpyxl
#sudo pip2 install openpyxl

import openpyxl

wb = openpyxl.load_workbook(filename = 'input/table.xlsx')

Now we can choose with the worksheet and get value of some cell:

ws = wb.worksheets[1]

print(ws["A1"].value)

Output:

Username

Ok. We can read one cell. What about the head of the table? If we have 9 columns in the table we can do it like this:

for col in range(1, 10):
    print(ws.cell(column=col, row=1).value)

Output:

Username
Surname
Attribute1
Attribute2
Attribute3
Attribute4
Attribute5
Attribute6
Attribute7

If we want to read the entire table (without head), we can do this:

first_cell_value = "1"
row = 2
table = dict()
while first_cell_value != None:
    if not ws.row_dimensions[row].hidden:
        table[row] = dict()
        for col in range(1, 10):
            table[row][col] = ws.cell(column=col, row=row).value
    row += 1
    first_cell_value = ws.cell(column=1, row=row).value

Note that I skiped the hidden rows.

From such a table structure we can get everything we need for analysis .

python-docx library for processing .docx files

Let’s figure out how to work with .docx files. I installed python-docx library and opened .docx template file:

#sudo pip install python-docx
#sudo pip2 install python-docx

import docx

document = docx.Document(docx = 'input/template.docx')

Here is how we can iterate through paragraphs in doc files, find a key word “USERNAME” and replace it with the required value “John”:

import re

for paragraph in document.paragraphs:
    if "USERNAME" in paragraph.text:
        paragraph.text = re.sub("USERNAME", "John", paragraph.text)

What if we need to read or edit the table in .docx file? You can read the table like this:

for table in document.tables:
    for row in table.rows:
        for cell in row.cells:
            for paragraph in cell.paragraphs:
                print(paragraph.text)

To insert rows into a table, we act like this:

rows_to_add = dict()
rows_to_add[1] = ['1','2','3','4']

for table in document.tables:
    for row in rows_to_add:
        cells = table.add_row().cells
        for n in range(0,len(rows_to_add[row]), 1):
            cells[n].text = rows_to_add[row][n]

And finally, save the modified template to a new file:

document.save('output/john_smith.docx')

Alexander Leonov

Hi! My name is Alexander and I am a Vulnerability Management specialist. You can read more about me here. Currently, the best way to follow me is my Telegram channel @avleonovcom. I update it more often than this site. If you haven’t used Telegram yet, give it a try. It’s great. You can discuss my posts or ask questions at @avleonovchat.

А всех русскоязычных я приглашаю в ещё один телеграмм канал @avleonovrus, первым делом теперь пишу туда.

2 thoughts on “Processing .docx and .xlsx files with Python”

Pingback: Asset Inventory for Network Perimeter: from Declarations to Active Scanning | Alexander V. Leonov

Rudra April 10, 2020 at 12:23 pm

Hi Leonov,

Great write up..
I have a question for you, Please suggest……..
I have a report template in Libreoffice writer (It has text and tables). I want to populate the Libreoffice calc data to this template.

Reply ↓

Alexander V. Leonov

Vulnerability Management and more

Processing .docx and .xlsx files with Python

openpyxl library for processing .xlsx files

python-docx library for processing .docx files

2 thoughts on “Processing .docx and .xlsx files with Python”

Leave a Reply Cancel reply