The Beginner's Guide to Python in Tableau Prep Builder (Windows).

As a user of Tableau Prep Builder you now have the ability to use Python scripts to manipulate your data. This post will go through the steps needed connect Prep to a script sitting locally on your machine, pass the data to it, and have the script pass it back.  With this connection up and running you will be free to transform your data any way you like. For example, this allows you to add a row ID to a table, or carry out a multi-row formula - both actions not currently available directly within Prep. I have a script on my computer that takes the positions and speeds of the Earth and the Moon today, and uses them to calculate their future positions for the remainder of this year (you can see the results in this dashboard).

The first step if you haven't done so already is to download Python. When going through the installation process I found it helpful to leave the box titled Add Python to Path ticked. On my first attempt I left it unticked, and had a few problems later on when entering instructions in Command Prompt. At this point I also found it helpful to restart my machine.

The next step is to install TabPy, the Tableau Python Server, on your computer. This is done via Command Prompt on Windows machines (if you don't know how to find this, type 'cmd' into the Windows search bar at the bottom left of your screen). With this open, type in the following command:

python -m pip install --upgrade pip

After this, I again restarted my machine, and then in Command Prompt typed in:

pip install tabpy

These were the only two commands I needed to get TabPy up and running, but more detailed instructions are available here if required.

Next we need a Python script to pass our Prep data to. The following is a very basic one I used to test the connection between Prep and TabPy.

import pandas

import numpy

def PlusOne(df):
    df=df.to_numpy()
    df=df+1
    df=pandas.DataFrame(df,columns=['Index'])
    return df

To use this you will need to save it as a .py file somewhere on your machine (use a programme like Notebook ++ to do this easily). One thing to note is that Prep by default gives tables of data to a Python script in the format of a data frame (part of the pandas library). All the above script will do then is take a single-columned data frame from Prep, turn it into a numpy array, add one to all the elements in it, then turn the array back into a data frame, and pass this back to Prep.

Another thing to note is that you will only be able to call one function from the script, so you need to make sure all the work you wish to do is wrapped up in this one master function. This seems to include definitions of all your supplementary functions, although importing libraries can seemingly take place outside the function definition, as above.

Now, in Command Prompt, type in

tabpy

Next, in Tableau Prep go to Help > Settings and Performance > Manage Analytics Extension Connection, and set the window to look like this:

If you have done no further configuration to the server, you should now be able to click Sign In. By default the TabPy server is open on port 9004, and requires no credentials. If you wish to change this, follow the instructions on this page.

For the sake of testing, you will need a basic table of data to work with. I put the following one together in Excel:

Connect to this data in Tableau Prep, add a Clean step, and then a Script step.

Next we need to configure the connection. Under Connection Type, select Tableau Python (TabPy) Server. Then, under File Name, navigate to the Python script you wish to use. In this case, it will be the .py file from above. Then under Function Name we need to tell Prep the name of the function in the script that the data frame will be passed to as an argument. In the case of the simple script up above I named this function PlusOne, so I would write this in here. Then, simply add another Clean step, and we should see the script transform the data.

If all has gone to plan, your connection should now be up and running. By default TabPy has a time-out, so occasionally you may need to restart it via Command Prompt. You can do this by simply retyping

tabpy

Now you are ready to transform your data any way you wish. Note that some further work will need to be done to your script if you wish your function to output a different schema from the original data frame input (for example, if you wish to add an extra column, or change data types etc...). For more details on how to do this, have a look here. An easy way around this though is to put dummy columns into your data before the Script step, and then simply overwrite onto these. For example if you wish to add a row ID, create a calculated field called 'Row ID' containing only zeroes, and have your Python script overwrite these with unique integers.

Author:
Matty Sweet
Powered by The Information Lab
1st Floor, 25 Watling Street, London, EC4M 9BR
Subscribe
to our Newsletter
Get the lastest news about The Data School and application tips
Subscribe now
© 2024 The Information Lab