At the end of week 1, at presentation time, DS10 presented an interesting case.
It was very much in line with the kind of research and non-profit work that I’m interested in: Operation Fistula.
DS10 shared with us how they came across a few hitches.
One of them was language related.
The dataset was in French, but none of them spoke it.
Alteryx, Tableau, and multi-language projects: How do they work?
I always think about international non-profits or research institutions that I follow, from J-PAL to IPA to IRC.
All of them work with global teams speaking different languages and in the field in several countries.
So I started wondering:
- What if workbooks had to be accessible to all team members in multiple languages?
- What if the data collected were in languages like Arabic, Hindi, Urdu, or Chinese?
- Does Alteryx recognize them?
- Are all Alteryx tools compatible?
- What about Tableau?
Want to keep your languages fresh?
One thing’s for sure: I am an avid language learner.
The first four months at the Data School are a deep dive into data analytics, data viz, writing, and studying.
There’s no time to take language courses to resume studying where I had left off.
So, I thought, why not find some datasets in some other languages?
It is one thing to work with English and Romance languages.
It is another thing to deal with Arabic or Chinese.
Could it get tricky?
Let’s find out!
Alteryx’s Find and Replace Tool and the Translator Macro
Conrad Wilson (DS10) did a great job in his post Translating Data In Alteryx With The Find And Replace Tool explaining how to use the Find and Replace tool to translate text in Alteryx.
In week 3, Peter Gamble-Beresford also introduced us to a Translator Macro (here’s a post about it!).
I, however, wanted to explore the software without taking these steps.
A French native speaker wouldn’t have to translate datasets written in French to English and then back to French, right?
My weekend challenge: finding a dataset in Arabic or Chinese
Working with spreadsheets and documents in Arabic and Chinese may get tricky at times.
You type in Arabic and then in English in Word, and things start moving from right to left.
You paste an Arabic URL from the internet, and it gets converted into codes.
And so on.
So, the language for this week is Arabic!
I don’t even remember how I got to the Forbes Middle East website.
However, I found the right dataset for what I had in mind.
It’s from the article “أقوى 100 شركة ناشئة عربية” (The Top 100 Emerging Arab Companies).
I scanned through it and jotted down a list of things to solve on Alteryx.
To-do list: Here are my top 7 challenges.
But first, let’s get some things straight!
- I didn’t solve everything.
- I didn’t use everything that I changed on Alteryx to build my dashboard on Tableau.
It’s learning time, right?
- It was only my second week here.
I might have tackled some of the issues the wrong way.
- In a few weeks, I might get back to this post and have a laugh. I may have figured out a way out of some tricky issues.
Who knows?
Challenge #1: Paste the table from the website into Excel/Google spreadsheets.
When I copied the table from the website into both Excel and a Google spreadsheet, it appeared mirrored.
The last column on the right became the first one on the left.
No issue here.
Rank now appears in the first column.
Another curious thing, which shouldn’t bother me, is that cell F9 is empty, and the overflow from cell G9 runs left, not right.
Who knows whether Alteryx messes things up with right-to-left languages?
I expect to encounter the usual issues of typing English words and then Arabic words in a document and fighting with my keyboard to get the spaces on the right side.
#SPOILER alert: From what I’ve seen, for some things it does. Read on below.
Challenge #2: Extract embedded links in column “Name.”
Will I be able to extract the embedded URLs from that column and turn them into a new column as String values?
I solved and explained this step in a separate post.
It’s loaded with step-by-step guidelines and lots of screenshots!
***Solved! *** Check out my second post: How to extract embedded Hyperlinks with Alteryx.
Challenge #3: Founders’ names in the same cell.
Text to Column tool, right? Yeah, um.
Under Founder, I found multiple founders’ names for each company in the same cell.
My end goal was to split the long strings in each cell and create:
- a column with the founders’ first name,
- a column with the founders’ last name.
It seemed like such an easy win . . .
Inside the cells, names are separated by:
- “و” [waaw], the Arabic “and” (e.g., ماجد أبوخطار ومحمود كيال CEO),
- a comma, the Arabic comma (،)
- Or both, of course!
Will I encounter issues trying to use them as separators in the Text to Column tool?
Text to column tool. Separator = “،”. Done.
Ok, let’s say there are two names separated by the Arabic comma.
I could use “،” as a separator. And it works.
What if separators are both ، and و?
In some cells there are more than two founders, and there could be both ، and و in the same sentence.
و (and) is a conjunction and a tricky one.
It’s not a standalone word and is always attached to the word that follows it.
It should look like this, with no space:
CEO ماجد أبوخطار ومحمود كيال.
Also, Arab names may start with و!
How can you determine whether و is the first letter of a name or a last name, or a conjunction?
Should we merge another dataset of all Arab names that start with و and make sure that Alteryx detects them?
Our formula could be something like:
If after a space a word starts with و and the word with و is not a recognized first name, THEN consider it as a separator and split the text into multiple columns/rows.
What about last names?
Even last names can start with و.
For sure, it would be quite a challenge to detect them. There are quite a lot of last names out there . . .
. . . and translated foreign last names.
Look at these two founders’ names:
كارل نعيم، مارك واشنطن
(Karl Naim, Marc Washington)
There’s a comma and a و , but و is the first letter of a last name! The Arabic translation of “Washington”!
So, even if we could list all possible Arab last names starting with و, we couldn’t possibly include all the ones translated into Arabic from another language.
It’s getting messy out here.
Want to try another formula?
Something like (I still haven’t figured out how it could be done):
If after two words (first name + last name) the next word starts with و THEN split?
Was he Abdulaziz ibn Abdul Rahman ibn Faisal ibn Turki ibn Abdullah ibn Muhammad Al Saud?
Ok, most Arabs don’t have a last name as long as the founder of Saudi Arabia.
However, often Arabic names aren’t just made up of first name + last name.
Some Arabs may have a chain of names that follow a specific naming system and order.
They may contain:
- ism (اسم), the given name/first name and laqab (لقب);
- nisbah (نسبة), which most often appears as a demonym, a word used to denote a person who inhabits or is native to a particular place, e.g., Al-Baghdadi, from Baghdad;
- kunya (كنية): a kind of epithet. Abū (أبو, father) or umm (mother) in a genitive construction, i.e., “father of” or “mother of;”
An example:
- “ ماجد أبو خطار ومحمود كيال CEO” or
- “كارل أبو زيد، زياد جريديني”
We are going on a very deep tangent here.
And we’re not finished! Who’s the CEO?
Look at the examples from a few lines above:
ماجد أبوخطار ومحمود كيال CEO
(Mahmud Kial wa CEO Majid ‘Abukhtar)
The two parts in blue and purple are the two founders’ names separated by و.
What does the “CEO” piece belong to?
ماجد أبوخطار (Majid ‘Abukhtar), right?
Well, let’s look back at the original table on the website.
You read from right to left, so Majid ‘Abukhtar is the “CEO,” not Mahmud Kial.
The Google spreadsheet gets it right.
What about Alteryx?
See for yourself.
It has switched the CEO to the left.
Now CEO comes before Mahmud Kial.
I’m getting confused. Am I missing something?
Did you forget the brothers?
It seems like there’s not going to be an easy end here.
Imagine two founders who are also brothers.
Why repeat the last name, right?
Here they are: Tariq and Andrew, Ali and Mohammed.
طارق واندرو كبريت (Tariq and Andrew Capriet/Capreit)
علي ومحمد أبو الحسن، أحمد المنيس (Ali and Mohammed Abulhasan, Ahmed Menis)
For this, I can’t really find a way out.
Very very last, but not least: Should I write the formulas right to left or left to right?
Let’s get back to what I wrote before. If a formula could say:
If after two words (first name + last name) the next word starts with و THEN split,
Would Alteryx recognize that the text goes from right to left?
Or the first letter of a word in Arabic:
ماجد أبوخطار ومحمود كيال
<======== (read from right to left) would be interpreted as:
The last letter of a word in English
=========> (read from left to right)?
Gheorghie gave me some useful tips today on how the REGEX formula could look like for a similar issue on regexr.com.
And the formula seems to recognize the language! From right to left.
Internationalization: debates in the Alteryx community
In the Alteryx community, under “Designer Ideas,” there are quite a few comments about language-related issues.
Some of the comments mentioned the “internationalization and localisation of Alteryx tools” or “parsing non-English strings.”
There’s still a lot I need to understand about this topic.
For now, I keep this challenge open!
Let’s get some things straight.
We are talking about 100 rows here.
I could have taken the time to manually make the changes that I needed to make.
But, let’s say you’re working on a file with thousands of rows or a million.
I would really like to see what could be solved and what could not.
And if I have addressed something the wrong way or if you have an easy suggestion to share, comments are more than welcome!
Challenge #4, finally! Convert string values with “million” and “thousand dollars” into numeric values.
Look at the highlighted cell: مليون دولار
52.
It is “52 mlywn dwlạr” (let’s not get into romanization debates here!), which means “52 million dollars.”
What do I want to do?
I want to convert all the strings to a number.
The strings appear this way:
- مليون دولار # (# million dollars, e.g., مليون دولار 1) (I can’t manage to type the numbers on the right here on the blog!)
- ألف دولار # (# thousand dollars: e.g., ألف دولار 952)
- مليون دولار (million dollars, with no number)
Values look like:
- 2.4 million and
- 56 million.
This is how I solved it. I hope it’s the right way!
It was also puzzling to see how, as soon as I started a formula, opened a bracket to add a field with an Arabic title, and then kept on typing to complete the formula, this happened:
What should have been on the right (the highlighted part) jumped to the left.
Typing the formula in multiple rows helped get it in the right order.
Challenge #5: Split the list of investors separated by a comma.
Only the top 10 companies had information about their investors.
English names. Text to column tool. Easy.
I used the Summarize tool to group by Investors and count the RecordID.
I could get an idea of how many investors had invested in more than one company.
Challenge #6: Rename fields and clean the data.
Done.
Challenge #7: Column “الدولة” (State) contains only 6 states. Will they be recognized by Tableau?
- United Arab Emirates
- Lebanon
- Saudi Arabia
- Egypt
- Palestinian Territories
- Jordan.
Time to switch to Tableau.
Tableau recognized the states even though countries were written in Arabic.
All but four countries.
I matched the data with the correct locations.
The map showed up.
One more thing: “فلسطين” appeared as “Palestinian Territories.”
The debates around mapping that area lead to a whole other topic too complex to be addressed here.
However, I changed “Palestinian Territories” to “Palestine.”
There might be other iterations. For now, this is what my viz looks like!
On Tableau Public.
Data adapted from an article on Forbes Middle East.