for Microsoft Power Automate and Azure logic apps

Extract text from a Microsoft Word Document

By Jay Goodison

8th February 2021

Search Word Documents with Microsoft Power Automate

Several customers have recently asked how to search text within Microsoft Word documents, and this post outlines how this can be easily achieved.

I’m going to break the post down into a few sections as there are different options/techniques which should be used to locate text based on your requirements. However, the first step is to obtain the text from within the document.

Obtain text content from a Word document

For example, I will check whether new documents added to a SharePoint list contain email addresses and then update the records metadata accordingly.

NOTE:

The following steps can be performed on DOC, DOCX, DOTX and RTF Files.

1. Create a new ‘Automated cloud flow‘ flow in Power Automate

1.a. Flow name: Provide a name for your flow

1.b. Trigger: Select the ‘When a file is created in a folder’ SharePoint trigger action

1.c. Click ‘Create

2. Configure the ‘When a file is created in a folder’ SharePoint trigger action as required

NOTE: Before adding the Encodian action, you may wish to add some logic to your flow (or a trigger condition) so that only certain types of files are handled, for example:

3. Add the Encodian ‘Convert Word‘ action

3.a. Output Format: Select ‘TXT

3.b. Filename: Select the ‘x-ms-file-name-encoded‘ property provided by the ‘When a file is created in a folder’ SharePoint trigger action

3.d. File Content: Select the ‘File Content‘ property provided by the ‘When a file is created in a folder’ SharePoint trigger action

Our flow is now configured to obtain the contents of the Microsoft Word Document in TXT format (Text)… the Encodian action returns a TXT file, and we can easily access the contents by decoding the base64 file using the base64toString() expression as follows:

4. Add a ‘Compose‘ action and configure it as follows:

At this stage, your flow is ready to be tested! Add a Microsoft Word file to the SharePoint folder you are monitoring, and then check the run history:

Search Text with Power Automate

There are a couple of options which can be used to search text content within Power Automate:

The Contains() expression can be used to validate whether a specific value exists within a string; the following example shows how to check whether the text obtained from the document contains the word ‘Encodian‘:

Expression Reference: contains(outputs(‘Compose_-_TXT_to_Text’),’Encodian’)

This will provide a boolean value confirming whether the string is contained. Alternatively, we can use the ‘Search Text – Regex‘ action as follows (Continuing from Step 4):

This regex query will search for any contained email addresses.

5. Add the Encodian ‘Search Text – Regex‘ action

5.a. Text: Select the ‘Outputs‘ property provided by the ‘Compose‘ action

5.b. Regex Query:

(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9]))\.){3}(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9])|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])

The ‘Search Text – Regex‘ action will find all text matches and return an array (one or more); see below for an example:

We can now add logic to our flow to do something based on the result. For this example, I will update the metadata on the source file to indicate whether sensitive data is contained (like an email address) within the document.

6. Add a condition action, and configure it as follows:

The configured condition checks whether the matches returned by the Encodian action are greater than 0, i.e. confirming that matches have been found.

7. Inside the ‘If yes‘ thread, add the ‘Get file metadata‘ SharePoint action:

7.a. Site Address: Configure as per step #2

7.b. File Identifier: Select the ‘x-ms-file-id‘ property provided by the ‘When a file is created in a folder’ SharePoint trigger action

8. Add the ‘Update file properties‘ SharePoint action

8.a. Site Address: Configure as per step #2

8.b. Library Name: Configure as per step #2

8.c. Id: Select the ‘ItemId‘ property provided by the ‘Get file metadata‘ SharePoint action

The flow is now complete and can be used to check whether sensitive data is contained within Microsoft word documents!

Finally

We hope you’ve found this guide useful, and as ever, please share any feedback or comments – all are welcome!

You can find further documentation and guidance on the Encodian support portal: Convert Word

Leave a comment

Your email address will not be published. Required fields are marked *