for Microsoft Power Automate and Azure logic apps

Activate Free Trial

By Jay Goodison

10th October 2019

Automatically OCR PDF Documents added to a SharePoint Library

Wanting to know how to OCR PDFs as they are added to SharePoint automatically?

The Encodian connector provides an OCR action named ‘OCR a PDF Document‘ which checks a PDF document for the presence of a text layer and if one isn’t present it will perform OCR and add the text layer to the PDF document before returning the newly OCR’d PDF document.

The ‘OCR a PDF Document’ action can also perform a wide array of clean-up operations such as auto-rotate, deskew, despeckle, etc. Please review the documentation for further details.

Scenario

This article details a simple Flow for automatically performing OCR on PDF documents added to a SharePoint library to ensure the contents of the files can be indexed by SharePoint and can be more easily found by users.

Guide

1. Create a new Flow using the ‘Automated — from blank‘ option

2. Enter a name for the Flow, select the ‘When a file is created in a folder‘ SharePoint trigger action, click ‘Create‘

3. Configure the ‘When a file is created in a folder‘ SharePoint trigger action

3.a. Site Address: Enter the location of the SharePoint site where the target library / folder is held

3.b. Folder Id: Select the SharePoint folder which will be monitored for new PDF documents

4. Add a ‘Condition‘ action

4.a. Configure the condition action as per the image below, which will ensure that the Flow only attempts to apply the OCR action to PDF documents

5. Add a ‘Terminate‘ action within the ‘No‘ branch of the condition added in step #4

5.a. Status: Select ‘Succeeded’

6. Add an ‘OCR a PDF Document‘ action within the ‘Yes‘ branch of the condition added in step #4

6.a. Filename: Select the ‘Filename‘ property from the ‘ When a file is created in a folder‘ SharePoint trigger action

6.b. File Content: Select the ‘File Content‘ property from the ‘ When a file is created in a folder‘ SharePoint trigger action

OPTIONAL SETTINGS

Please review and change the following advanced options as required:

Language: Select the preferred language, the default is set to ‘English‘

Clean Operations: When setting to ‘Default‘ the OCR action will perform a default collection of clean-up operations including auto-rotate, auto deskew and auto despeckle. To select a specific set of clean-up operations, select ‘Specific‘ and then enable required clean-up operations.

Guide – Continued

7. Add a SharePoint ‘Create file‘ action

7.a. Site Address: Set to the value of the SharePoint site set in step #3.a

7.b Folder Path: Set to the same value of the ‘Folder Id‘ property set in step #3.b

7.c. File Name: Select the ‘Filename‘ property from the ‘OCR a PDF Document‘ Encodian action

7.d. File Content: Select the ‘File Content‘ property from the ‘OCR a PDF Document‘ Encodian action

8. The completed flow should follow this construct:

9. Now let’s test the flow!

10. Select ‘I’ll perform the trigger action‘ and click ‘Save & Test‘

NOTE: You can ignore the recursive event warning as the Flow is configured to overwrite and existing document which will not re-fire the event. If you rename the file, thus creating a brand new file, the Flow will re-run. To avoid recursive event triggers review our post of the Power Automate Community Blog: SharePoint – Managing Recursive Events in Flow

11. Add a PDF document to the SharePoint folder set in step #3.b

12. Validate a text layer has been added to the PDF document

13. Repeat the test with a non-PDF document

Finally…

Hopefully this post provides a good guide for ensuring PDF documents in your SharePoint libraries have been correctly OCR’d.

We hope you’ve found this guide useful, and as ever, please share any feedback or comments – all are welcome!

Get an Encodian API Key for Microsoft Power Automate

6 Comments

Max says:

April 6, 2020 at 2:27 am

Hi, I’ve just tested this and it works but my PDFs are ending up roughly 10x larger than they started.

Is there a setting for quality anywhere? Or do I have to disable the processing items to keep file size comparable?

Reply
1. Jay Gooodison says:
  
  April 6, 2020 at 9:10 am
  
  Hi Max,
  
  Yes, this happens when ‘Clean Operations’ are specified. When these are enabled each page of the PDF document is broken down into a 300 DPI image (500K to 1MB) in size before the selected image optimisation operations are applied, on completion, the PDF document is regenerated from the new images resulting in larger file sizes. To disable the make sure you have set the ‘Clean Operations’ parameter to ‘none’, this will then attach the generated text layer back to the original document rather than building a new document from the enhanced images.
  
  Hope this helps
  Jay
  
  Reply
Tita Atang says:

February 2, 2021 at 9:53 am

Hi,

Ive added a pdf as per the instructions but unfortunately the pdf is still not ocr searchable. Is there a setting I need to enable for it to be more sensitive? The pdf I used contains a table with words and numbers. I used OneDrive Scan option (on my mobile) to generate the original pdf.

Reply
1. Jay Goodison says:
  
  February 7, 2021 at 10:14 am
  
  Hi Tita, can you please email your flow configuration and document to support@encodian.com? Typically, we see this reported where a document is provided to the Encodian action, but the output (the OCR’d PDF) hasn’t been used… i.e. you might use the SharePoint ‘Update File’ action to overwrite the source PDF document.
  HTH
  Jay
  
  Reply
Thai says:

May 31, 2022 at 9:10 pm

Once this flow is built, will I be able to use the search function in SharePoint to search text in the documents, or will I only be able to search on a selected document? Would it make more sense to use a PowerApp on top of SharePoint with Encodian running in the background for this functionality?

Reply
1. Jay Goodison says:
  
  June 6, 2022 at 7:06 am
  
  Yes, when you add a text layer to a PDF document through OCR, SharePoint will index the new text layer thus the document will appear in M365 search.
  
  Reply

Cookie	Duration	Description
ARRAffinity	session	ARRAffinity cookie is set by Azure app service, and allows the service to choose the right instance established by a user to deliver subsequent requests made by that user.
ARRAffinitySameSite	session	This cookie is set by Windows Azure cloud, and is used for load balancing to make sure the visitor page requests are routed to the same server in any browsing session.
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category .
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
CookieLawInfoConsent	1 year	Records the default button state of the corresponding category & the status of CCPA. It works only in coordination with the primary cookie.
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_gat_gtag_UA_107931024_1	1 minute	Set by Google to distinguish users.
_gcl_au	3 months	Provided by Google Tag Manager to experiment advertisement efficiency of websites using their services.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
CONSENT	2 years	YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.

Cookie	Duration	Description
test_cookie	15 minutes	The test_cookie is set by doubleclick.net and is used to determine if the user's browser supports cookies.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.

Automatically OCR PDF Documents added to a SharePoint Library

Scenario

Guide

OPTIONAL SETTINGS

Guide – Continued

Finally…

Get an Encodian API Key for Microsoft Power Automate

New Power Automate Action: Add to Archive (ZIP)

6 Comments

Leave a comment Cancel Reply

Related posts

Best strategies to OCR scanned documents for SharePoint Online

MVP-exclusive event!