for Microsoft Power Automate and Azure logic apps

Activate Free Trial

Zonally extract data from documents with Microsoft Flow

By Jay Goodison

16th September 2019

Zonally extract data from documents with Power Automate

Zonally extracting data sounds daunting, but it doesn’t have to be!

Many of us, over time, will have worked on projects/solutions where there is a requirement to extract data from documents and do something with that data. A typical scenario could be processing a scanned document or documents sent from an external source, which is commonplace in ‘Invoice Processing’ scenarios.

This step-by-step guide details how to configure a Microsoft Power Automate Flow to extract data from a PDF document and add the data as metadata to the current document.

Scenario

The finance department generates invoices using a third-party application which uploads the documents to a SharePoint library for storage. To enable invoice reporting, tracking and related activities, we have a requirement to extract data from each invoice and add metadata to the document. The SharePoint library is configured as follows:

Guide

1. Create a new Flow using the ‘Automated — from blank‘ option

2. Enter a name for the Flow, select the SharePoint ‘When a file is created in a folder‘ trigger, and click ‘Create.‘

3. Configure the ‘When a file is created in a folder ‘ trigger action setting the ‘Site Address’ and ‘Folder Id’ fields to the location where documents will be added.

NOTE: For this demo, documents will already be in PDF format. However, should there be a need to extract data from a Word document, PowerPoint file, CAD drawing etc., convert to PDF first using the Encodian ‘Convert to PDF‘ action

4. Add the Encodian ‘Extract Text Regions‘ action

4.b. Filename: Select the ‘File name‘ property from the ‘ When a file is created in a folder‘ action

4.c. File Content: Select the ‘File Content‘ property from the ‘ When a file is created in a folder‘ action

To progress the configuration of the ‘Extract Text Regions‘ action, we need to provide coordinates of the data on the source document, i.e. Zonal extraction.

So how do we get the coordinates? Easy! Use the ‘Text Region Generator‘ utility in the Encodian administration portal.

4.d. Upload a sample PDF document

4.e. Drag and move the area selector to the target area of the document

4.f. Define a name for the region and then click ‘Add to JSON‘

4.g. Repeat this process for all target regions of the document.

4.h. Copy the generated JSON data into your clipboard

4.i. Go back to Microsoft Flow; On the ‘Extract Text Regions‘ action, click the ‘Switch to input entire array‘ icon

4.j. Copy and paste the JSON data obtained in step 4.h. into the ‘Text Regions‘ field

5. We now need to obtain a sample of the generated JSON data, enabling us to add additional actions to parse and use the returned JSON data.

5.a. Test the Flow using your preferred method. Click ‘Save & Test‘

5.b. For this example, I selected ‘I’ll perform the trigger action‘, which I invoked by manually uploading a PDF invoice document to the SharePoint library aligned to the configuration of the trigger action (step 3).

5.c. Once the Flow has been executed, open the ‘Extract Text Regions‘ action and copy the ‘Simple Text Region Results‘JSON returned.

NOTE: If you have submitted a large file, Flow may display the outputs differently, prompting you to download the output manually. See the example below:

You’ll need to manually download the payload and locate the ‘Simple Text Region Results‘ variable. You’ll also need to manually remove any escape characters ‘\’ using either a text/code editor or an online service.

If you require further guidance on how to Parse JSON data, please review the following post: Parsing JSON returned by Encodian Actions

6. Add a ‘Parse JSON‘ action

6.a. Content: Select the ‘Simple Text Region Results‘ property from the ‘ Extract Text Regions ‘ action

6.b. Click ‘Generate from sample.‘

6.c. Paste the ‘Simple Text Region Results‘ obtained in step 5.c into the text-area control, and click ‘Done.‘

7. Add a ‘Get file metadata using path‘ action

7.a. Site Address: Set as per step 3.

7.b. File Path: Select the ‘File path‘ property from the ‘ When a file is created in a folder‘ action.

8. Add an ‘Update File Properties‘ action

8.a. Site Address: Set as per step 3.

8.b. Library Name: Set as per the library name contained within the ‘Folder Id‘ property of step 3.

8.c. Id: Select the ‘ItemId‘ property from the ‘Get file metadata using path‘ action

8.d. Map data from the ‘Parse JSON‘ action to the relevant fields

9. Test the Flow by using data from the previous run

10. Validate the flow run has successfully executed

11. Validate data has been extracted and added as document metadata correctly

While this example has focused on how to extract document data before setting SharePoint document metadata, once the data has been extracted, you can do anything with the data using the power of Microsoft Power Automate!

We hope you’ve found this guide to zonally extract data valuable! As ever, please share any feedback or comments. All are welcome!

Fix the “InvalidTemplate. Unable to process template language expressions in action” error in Power Automate

9 Comments

Data Integration Info says:

June 8, 2020 at 11:49 am

Can we use this software to extract image OCR data from a PDF file? Is that possible?

Reply
1. Jay Gooodison says:
  
  June 12, 2020 at 10:33 am
  
  Hi,
  Yes, please refer to the OCR a PDF document action
  HTH
  Jay
  
  Reply
Jack Selby says:

June 18, 2020 at 12:37 pm

hello, once i have my parse JSON Schema set up, can i use the data collected to auto fill selected PDF files with the same input criteria ?

Reply
1. Jay Gooodison says:
  
  June 22, 2020 at 10:16 am
  
  Hi Jack,
  Yes, please refer to the following post: Fill a PDF Form with Microsoft Power Automate
  Cheers Jay
  
  Reply
Warren Gibbs says:

August 20, 2020 at 4:28 pm

This is very nice solution! A viable alternative to using Forms Processing in AI Builder.

Reply
Dean Birks says:

November 26, 2020 at 9:44 am

This is great. Works really well. Would you be able to then make the flow distinguish different purchase orders by supplier name and have a different data extraction regions for each supplier within the same folder?

Reply
1. Jay Goodison says:
  
  December 14, 2020 at 8:43 am
  
  Hi Dean,
  Yes, but this is not an AI extraction based solution, you’ll need to extract data and build the regions manually and configure your flow to manage this. HTH
  
  Reply
Jason Davis says:

December 2, 2020 at 5:47 pm

Is there a recommended solution within Encodian to extract data from a PDF that was originally an excel file where there are multiple rows for a given column?

Reply
1. Jay Goodison says:
  
  December 14, 2020 at 8:46 am
  
  Hi Jason, you can do this with the extract regions action by creating a region where data might appear… really you should look at Microsoft’s AI solutions for the Power Platform. HTH
  
  Reply

Cookie	Duration	Description
ARRAffinity	session	ARRAffinity cookie is set by Azure app service, and allows the service to choose the right instance established by a user to deliver subsequent requests made by that user.
ARRAffinitySameSite	session	This cookie is set by Windows Azure cloud, and is used for load balancing to make sure the visitor page requests are routed to the same server in any browsing session.
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category .
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
CookieLawInfoConsent	1 year	Records the default button state of the corresponding category & the status of CCPA. It works only in coordination with the primary cookie.
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_gat_gtag_UA_107931024_1	1 minute	Set by Google to distinguish users.
_gcl_au	3 months	Provided by Google Tag Manager to experiment advertisement efficiency of websites using their services.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
CONSENT	2 years	YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.

Cookie	Duration	Description
test_cookie	15 minutes	The test_cookie is set by doubleclick.net and is used to determine if the user's browser supports cookies.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.

Zonally extract data from documents with Power Automate

Scenario

Guide

Fix the “InvalidTemplate. Unable to process template language expressions in action” error in Power Automate

Build a proposal document dynamically using SharePoint and Power Automate

9 Comments

Leave a comment Cancel Reply

Related posts

Best strategies to OCR scanned documents for SharePoint Online

MVP-exclusive event!