Skip to end of metadata
Go to start of metadata

You are viewing an old version of this content. View the current version.

Compare with Current View Version History

« Previous Version 14 Next »

The Microsoft 365 ecosystem data center being in Europe could only be considered as a viable and acceptable application solution if the consent of those concerned by the local legal framework (Quebec law 25 and 64) had to be given. However, the complexity of the process means that Microsoft 365 document management applications such as SharePoint, etc. will not be part of the application architecture.

From milestone 1 to milestone 2

Initial process for loading scanned documents into on-premises servers: the service that retrieves the documents will be able to run a dual loading process, i.e. one in the vault server and the other as a springboard for loading into an Azure storage service.

image-20240315-154058.png

1- Scanning is performed from a secure workstation connected to the on-site network, and documents are automatically deposited in a network directory.

2- The service is triggered automatically or manually to retrieve the documents from the network directory.

3- The service stores the documents on the safe server in a secure network directory.

4- In parallel, the service stores the documents in the server that will serve as a springboard for the Azure storage service.

5- Processes 2, 3 and 4 will be logged on the same server on which the service is deployed.

Subsequent process for loading scanned documents into on-site vaults: the service retrieving the documents may execute a loading process, i.e. deposit the document in the vault server. Then, another service will react to this event by extracting the document and depositing it in the server dedicated to integration with the Azure storage service.

image-20240315-175035.png

1- Scanning is performed from a secure workstation connected to the on-site network, and documents are automatically deposited in a network directory.

2- The service is triggered automatically or manually to retrieve the documents from the network directory.

3- The service stores the documents on the safe server in a secure network directory.

4- The service logs processing in its execution server.

5- The service deployed in the server bridging the Azure storage service, extracts the document stored in the vault and stores it in its runtime server.

6- Once the process is complete, the service logs the processing.

From milestone 2 to milestone 3 : Storage Architecture

The application architecture refers only to milestone 3 (Use of Azure Services) in which there are 2 components to consider:

  • the document/image component: document indexing with a document indexer (AI Search)

  • the video component: indexing videos with a video indexer (AI Video Indexer).

From these 2 components, a dual application architecture can be seen, because the AI technologies of the Azuze platform present 2 data mining technologies:

→ for video, it involves video knowledge mining;

→ for document and image, the aim is to perform knwoledge mining of documents and images.


Choisir la technologie de transfert de données vers Azure ou à partir de Azure

  1. Criteria

  • Need to transfer large amounts of data with a time-consuming Internet connection: physical transfer (Azure Data Box).

  • Need to transfer large amounts of data with a moderate or fast Internet connection:

    • if moderate Internet speed: physical transfer (Azure Data Box).

    • if high-speed (from 1 Gbps): AzCopy, Azure Data Box (Virtual Version), Azure Data Factory, Azure Storage REST Apis (SDKs).

  • Need to orchestrate any quantity: Azure Data Factory.

  • Need to log whatever the quantity: Azure Data Factory, Azure Storage REST APIs.

  • Need to transfer small amounts of data with a slow to moderate Internet connection:

    • Without programming: GUI (Azure Storage Explorer Utility), Azure Portal, SFTP Client.

    • With programming: AzCopy/PowerShell, Azure CLI, Azure Storage REST APIs (Azure Functions, Applications).

  • Regular or continuous transfer requirements:

    • Regular interval: AzCopy, Azure Storage APIs.

    • Continuous interval: Azure Data Factory, Azure Data Box for online transfer

  1. Limitations and disadvantages (helping to make decision)

Technology

Limitations

SFTP Client

Authentication & Authorization

  • Microsoft Entra ID (Azure Active Directory Identity) not supported for SFTP endpoint.

  • ACLs (Access Control lists) not supported for SFTP endpoint.

  • Local user is the only form of identity management supported.
    Network

  • Network must open port 22

  • Static IP addresses are not supported.

  • Must use Microsoft Internet Routing as Internet routing is not supported.

  • 2-minute timeout for standby or inactive connections.
    Performance

  • Performance degradation over time, impact of network latency.
    Other

  • Constraint on use of account hierarchical namespace functionality (must be enabled, requires Azure Data Lake Gen2 capabilities).

  • Maximum file size is 100 GB.

  • To launch storage redundancy/replication, SFTP must be deactivated.

AzCopy Utility

  • Synchronous operation only, no asynchronous copying.

  • Potential timeout, dependency on on-premise infrastructure, dependency on our network.

  • Impact of log generation on performance.

  • Unexpected behavior: if file size reaches 200 GB, AzCopy splits the exported file.

  • No traceability, just a technical log.

Azure Data Factory

  • Resource limitations - but not really, as it's designed for large quantities of data.

  • If we develop pipelines and there is a lot of transformation that is done outside ADF's native activities, this will make it more difficult to estimate the cost of the operation at the end.

  • Long-term operations are more expensive than on-premise solutions.

Azure Function

  • Not suitable for long, resource-intensive runs.

  • Cold-start latency means you have to pay more for the plan.

  • Despite stateless architecture, if data persistence is required, then the architecture becomes more complex as data is fetched across services.

  • Limited debugging

Initial/subsequent document loading process

Option A : Azure Data Factory with Copy Data Tool (built-in service)
image-20240315-181701.png

1- Configure the self-hosted integration runtime with the on-premise server containing the documents.

2- Azure Data Factory invokes its native copy task with the Copy Data Tool.

3 & 4- Azure Data Factory extracts and copies the documents into the Storage Account's Blob Storage.

5- From the Azure Storage Explorer application interface, visual validation is possible.

Option B : Azure Data Factory with Azure Files
image-20240315-183249.png

1- Configure Azure Files with the network directory (on-premise). Any document deposited in the local network will automatically appear in the Azure Files container.

2- Azure Data Factory invokes its native Copy Data Tool task and extracts documents from the network directory.

3- Azure Data Factory with Copy Data Tool stores the documents in the Storage Account's Blob Storage.

4- From the Azure Storage Explorer application interface, visual validation is possible.


Option C : Azure Data Factory with Azure Function
image-20240315-191714.png

1- Azure Data Factory invokes Azure Function (triggering).

2- Azure Function uses the self-hosted integration runtime with the on-premise server to communicate with the document server.

3- Azure Function copies the documents.

4- Azure Function stores the documents in the Storage Account's Blob Storage.

5- From the Azure Storage Explorer application interface, visual validation is possible.

Option D : AzCopy
image-20240315-192015.png

1- PowerShell invokes directory loading with the azcopy.exe executable

2- AzCopy extracts documents from the target directory

3- AzCopy stores the documents in the Storage Account's Blob Storage.

4- From the Azure Storage Explorer application interface, visual validation is possible.

  • No labels