Running Power BI On-Premises Data Gateway on AWS for $0.12 a day
Today, I’d like to discuss in detail how to deploy a fully functional (and reliable) Power BI Data Gateway, and the difficulties to overcome in doing so. As a Christmas present, you’ll get code snippets and a fully functional GitHub repository to clone with Terraform code to deploy :)
What the h… is Power BI On-Premises Data Gateway and why on earth would you use it?
Why would you even need Power BI when you have the AWS cloud and its extensive range of data (management and analytics service (just to name a few: Glue ETL, S3 storage, Athena in-place querying, Redshift warehousing and Quicksight data visualisation) ?) ?
Hum.. remember how business users love Excel? Well, Power BI is the new Excel :) Its ability to pop a beautiful dashboard(if you’ve been in the software industry, you probably know from experience that no software sells well without a dashboard feature, even though no one will ever use it for real) from both a local xlsx file and a database makes them believe that data management is a piece of cake (sweeping under the rug the hardships of data lineage, cataloging and “minor” issues like scalability, security, governance).
So your app’s data is in the cloud, and someone wants to analyze it using Power BI. Being a good professional, you won’t let them run queries on your production database. Here comes the “Power BI On-Premise Data Gateway”!
- The Gateway essentially proxies your app database for the Power BI service, removing the need for network connectivity between the service (or its users) and the database.
- If you have no control on what users will do in Power BI, I advise you to avoid the Direct Query mode and set up a schedule data refresh in what Power BI names a Dataflow (a blob storage that users will then query).
Here’s what we’ll deploy
The code in the repository at the end of this article deploys the following set up:
The reasons for setting up the Data gateway as an EC2 instance within an AWS AutoScaling Group are the following:
- Cost (and environmental footprint): the ASG is a nice way to schedule the creation and termination of our EC2 instance. In order to perform a scheduled refresh, we just need the instance to boot a couple of minutes before the refresh is triggered and we can kill it right afterwards. In this case, I also use spot request to cut cost by ~30%.. I might skip one refresh if AWS has no resources to spare. An m5a.large (the instance type that matches Microsoft rather greedy requirements), run 3 times a day, 20 minutes every time, will cost me $0.12 a day, that is less than $50 a year.
- Reliability without any maintenance: the instance boots from the same image every time. This way, we’ll never face issues like a full disk, memory leaks or Windows getting slow (yeah, the Gateway doit run on a Windows server). Cattle, not pets!
Since Power BI experts told me the Gateway might need regular updates to follow Power BI’s release cycle, my initial design involved an EC2 Image Builder Pipeline to generate a brand new machine image, combining a. the latest Windows AMI b. the latest version of the Power BI Gateway package, and c. a scripted installation and configuration of the gateway.
Alas! The gateway is unfortunately a rather unfinished product, with major drawbacks when it comes to devops / including it in a continuous integration pipeline :
- Even though Microsoft engineers came up with a Powershell module to automate set up tasks, the provided cmdlets can only create a full gateway cluster, but can’t register an installation of the gateway package as an additional member of an existing cluster. So much for idempotence.
- The gateway app relies on external drivers. The driver for PostgreSQL is npgsql (in a rather old version, 4.0.10) and needs to be installed with a non-default option (install it in the Global Assembly Cache, or GAC) that is unavailable with the unattended / quiet install So much for automation.
For these reasons I had to come to terms with the fact that I’ll have to update the windows image manually.
Deploying the infrastructure
The infrastructure itself is quite simple. In the Autoscaling module :
- We define schedules block that tell when to scal up or down.
- We make sure we procure instances at spot price
- We give the instance a role that enables to connect using Remote Desktop Protocol via AWS Systems Manager
- We make sure that the instance can reach both the database and the Power BI service
Et voilà !
The fully functional Terraform code is available in this repository in my Github account: https://github.com/psantus/powerbi-onpremises-data-gateway.terraform
First, deploy it with AWS’s standard Windows AMI. Then, once you done the initial set-up described below, make a VM Snapshot and tell the ASG to use it for subsequent machine creation.
Initial set-up of the Power BI gateway
After you’ve deployed the standard Windows image, you can follow these steps to install the Power BI Gateway packages and configure them as well as the Power BI service.
- Connect to the machine via AWS Systems Manager
- Install Powershell 7 (yes, the DataGateway module is only compatible with that version) :
msiexec.exe /package https://github.com/PowerShell/PowerShell/releases/download/v7.2.6/PowerShell-7.2.6-win-x64.msi /quiet ADD_EXPLORER_CONTEXT_MENU_OPENPOWERSHELL=1 ADD_FILE_CONTEXT_MENU_RUNPOWERSHELL=1 ENABLE_PSREMOTING=1 REGISTER_MANIFEST=1 USE_MU=1 ENABLE_MU=1 ADD_PATH=1
pwsh Install-Module DataGateway -Force Import-Module DataGateway -Force
# Set secret in a secure string $secureClientSecret = (cat .\Desktop\secret.txt | ConvertTo-SecureString -AsPlainText -Force) # Connect to the PowerBI Service Connect-DataGatewayServiceAccount -ApplicationId $AppId -ClientSecret $secureClientSecret -Tenant $TenantId
Install-DataGateway -AcceptConditions # Restart the service after installation (removes some random errors) net stop PBIEgwService net start PBIEgwService
$GateWayDetails = Add-DataGatewayCluster -GatewayName "My Gateway" -RecoveryKey $secureClientSecret -OverwriteExistingGateway # Get gateways and find the one you just created Get-DataGatewayCluster # Put its detail in a variable $GateWayDetails = Get-DataGatewayCluster -Id "xxxxxx-xxxxxx-xxxxxx-xxxxx"
Add-DataGatewayClusterUser -GatewayClusterId $GateWayDetails.Id -PrincipalObjectId "Azure AD User or Group ID here" -AllowedDataSourceTypes $null -Role Admin
$dsTypes = New-Object 'System.Collections.Generic.List[Microsoft.PowerBI.ServiceContracts.Api.DatasourceType]' $dsTypes.Add([Microsoft.PowerBI.ServiceContracts.Api.DataSourceType]::PostgreSql) Add-DataGatewayClusterUser -GatewayClusterId $GateWayDetails.Id -PrincipalObjectId "Azure AD User or Group ID here" -AllowedDataSourceTypes $dsTypes -Role ConnectionCreator
Resources
The fully functional Terraform code is available in this repository in my Github account: https://github.com/psantus/powerbi-onpremises-data-gateway.terraform
If you found this blog post useful, or are graceful enough to suggest improvements (that’s my first post, so I’m sure there’s room for some!), or have questions, just leave a comment!