Processing .docx and .xlsx files with Python

MS Office documents are probably one of the most inconvenient and poorly formalized data sources. It’s much better to keep all the data in specialized databases or at least in wiki. But in real life, MS Office documents are in active use in nearly every organization. Simply because it is a flexible and easy tool that anyone can use. That’s why it’s important to know an automated ways for processing such files.

Processing MS office files with python

You can easily edit .docx files without any libraries. Technically it’s just a zip archive. So, you can unzip it, make a replacement in the document.xml file and make a zip it again. It’s much better than dealing with old binary .doc files. But there are even more elegant ways.

Let’s says, we need to read data from .xlsx document and generate .docx files based on some existing template. To work with .xlsx files I will use openpyxl python library.

openpyxl library for processing .xlsx files

I installed the library and opened input/table.xlsx file:

#sudo pip install openpyxl
#sudo pip2 install openpyxl

import openpyxl

wb = openpyxl.load_workbook(filename = 'input/table.xlsx')

Now we can choose with the worksheet and get value of some cell:

ws = wb.worksheets[1]

print(ws["A1"].value)

Output:

Username

Ok. We can read one cell. What about the head of the table? If we have 9 columns in the table we can do it like this:

for col in range(1, 10):
    print(ws.cell(column=col, row=1).value)

Output:

Username
Surname
Attribute1
Attribute2
Attribute3
Attribute4
Attribute5
Attribute6
Attribute7

If we want to read the entire table (without head), we can do this:

first_cell_value = "1"
row = 2
table = dict()
while first_cell_value != None:
    if not ws.row_dimensions[row].hidden:
        table[row] = dict()
        for col in range(1, 10):
            table[row][col] = ws.cell(column=col, row=row).value
    row += 1
    first_cell_value = ws.cell(column=1, row=row).value

Note that I skiped the hidden rows.

From such a table structure we can get everything we need for analysis .

python-docx library for processing .docx files

Let’s figure out how to work with .docx files. I installed python-docx library and opened .docx template file:

#sudo pip install python-docx
#sudo pip2 install python-docx

import docx

document = docx.Document(docx = 'input/template.docx')

Here is how we can iterate through paragraphs in doc files, find a key word “USERNAME” and replace it with the required value “John”:

import re

for paragraph in document.paragraphs:
    if "USERNAME" in paragraph.text:
        paragraph.text = re.sub("USERNAME", "John", paragraph.text)

What if we need to read or edit the table in .docx file? You can read the table like this:

for table in document.tables:
    for row in table.rows:
        for cell in row.cells:
            for paragraph in cell.paragraphs:
                print(paragraph.text)

To insert rows into a table, we act like this:

rows_to_add = dict()
rows_to_add[1] = ['1','2','3','4']

for table in document.tables:
    for row in rows_to_add:
        cells = table.add_row().cells
        for n in range(0,len(rows_to_add[row]), 1):
            cells[n].text = rows_to_add[row][n]

And finally, save the modified template to a new file:

document.save('output/john_smith.docx')

Leave a Reply

Your email address will not be published. Required fields are marked *