Day 2 was upon us and we were tasked with a more analytical challenge this time.
We were given the job of exploring a Smoking Questionnaire's results and diving into the data to find interesting insights.
To start I looked at the data dictionary to see what fields I would like to use and started coming up with lots of questions and listing out what fields I could use to answer them.
![](https://www.thedataschool.co.uk/content/images/2021/12/image-278.png)
and threw a few sketches together to see how I might want to explore the data for example:
![](https://www.thedataschool.co.uk/content/images/2021/12/image-279.png)
But now it was time to actually get the data.
So I downloaded the data and at this first hurdle is where I started to fall. The data was stored in a dat file which in my understanding can't yet be read by Alteryx. As I didn't know how the file was structured I brought it into a text editor to have a look at it.
![](https://www.thedataschool.co.uk/content/images/2021/12/image-280.png)
There were lots of fields and it was quite difficult to understand but knew that I could bring the data in as a fixed width string which I'm going to show because it was useful and important for later.
![](https://www.thedataschool.co.uk/content/images/2021/12/image-281.png)
I went through many different attempts at how to extract the data. Trying parsing out each space, but not every field had a space, adding a space inbetween those values, but then there were some inconsistencies with the fields, regexing out the data but there was just too many fields!!
That's when it hit me I realised that in the documentation the fields referenced columns and there were 701 columns referenced and at this point I remembered that the length of the strings I had brought in were 701. So this is when I realised that in the documentation the column referenced the place in the string.
At this point I was frustrated and my morale was low but then I decided to do something crazy. I brought in the pdf of the documentation as a csv to get the places the field would be in the string. Then parsed out each character in the dat file and joined them together. IT WORKED.
![](https://www.thedataschool.co.uk/content/images/2021/12/image-282.png)
But to get to this point I had used a lot of my time so I had to quickly get on with the analysis.
![](https://www.thedataschool.co.uk/content/images/2021/12/image-312.png)