This post demonstrates a trivial way to perform data analysis and extract useful insights. This example shows some effective analysis using rigorous data visualization, hypothesis testing to solve a prediction problem, and building regression/classification model to learn something interesting.
I will be using python because it is the most comprehensive way for data analysis in my opinion.
Python modules used for this posts:
|
|
Dataset description
The dataset used for this post contains the following information:
Raw accelerometer sensor data is collected from the smartphone and smartwatch at a rate of 20Hz. It is collected from 29 test subjects as they perform 6 activities (walking, jogging, ascending stairs, descending stairs, sitting, and standing) over a span of time.
Attribute Information
- subject-id: ID describing the participant
- activity-code: timestamp: Unix time (integer)
- x: represents the sensor reading (accelerometer) for the x dimension
- y: represents the sensor reading (accelerometer) for the y dimension
- z: represents the sensor reading (accelerometer) for the z dimension
The provided dataset is a simple text file where each line represents a data point having the above attributes. Below is a small portion from the first line of the file.
|
|
Libraries used
So first things first, we will import all the necessary modules we need. Often this part starts from blank, slowly growing as we start feeling the need of a specific task and import related modules.
|
|
Load the data into memory
First things first, we need to load to data into memory to do futher processing. This part is pretty easy if there is a very small size single file. But this part could be very complex in case of large dataset. You might have to load the data in chunks (pandas
has option for that), or take samples from the whole dataset to fit it into memory.
I prefer to do this step in an if-else
block. I keep a boolean variable named FAST_LOAD
initially defaulted to False
at the top of the code. After some trial and errors with the loading of the data and finalizing the loading section, I save a kind of “snapshot” of the loaded data into disk so that I can easily load the data quickly from next time. Then I make the FAST_LOAD
variable to true. Next time I run that code cell, the data gets loaded instantly, no matter how large it was in the first place.
The train.txt
file does not seem to be a common csv file that we could load with pandas. CSV files usually end with a newline at each row. But this one has ;
(semicolon). After reading the data as string, we can replace the ;
with \n
and then try to load with pandas as csv. Below is the cell:
|
|
|
|
As our dataset is in a well-organized pandas DataFrame, we can get a lot of information with the help of pandas library!
|
|
|
|
|
|
|
|
Data Cleanup Before Visualization
We should not go directly to visualization before at least being sure that the visualization code will work properly. For example, a single null value the dataset will throw an error while trying to visualize it with matplotlib. Fortunately, plotly tries to handle most of these problems for use. But still this is always a good practice to make sure that the data does not have any extremely high or low value, or in some cases, zero values.
|
|
|
|
All rows containing zeros are gone. Now let us check for null values.
|
|
|
|
We can see that there are some null values in the z
column. We can remove that easily. Withing removing them, there are other options like replacing the value with average or median, but in this context, the nulls are very small, so removing them will not hurt that much.
|
|
Data visualization
Let us start with the class imbalance property check. This is to check if the data is equally divided to all the target column values (in this case activity
column). Below is the code to check it on the dataframe.
|
|
There is class imbalance problem in the dataset. Deep learning based approaches usually do not work well with class imbalanced dataset. The easiest way to solve this problem is by sampling the data while training (but often this is not enough, because it can overfit the model to some classes).
|
|
Now to determine the difference in the sparsity of the data, we can create a scratter plot. For 2D and 3D data (this dataset has 3 numeric attribute: x, y, z), we can directly plot the data. But when the data has more that 3 dimensions, we either need to plot a part of the dimensions, or we can use a dimension reduction method like PCA (Principles Component Analysis).
|
|
Now for trainning a model, we need the data attributes (usually named X
) and their corresponding labels (usually named y
).
Time series Data
Because the given data is a very long time series data, we can create a window of length time_len
. Then, we can sample time_len
consecutive data-points (x
, y
, z
) to get single windows from the dataset. Although we can do this as much as we want, but a good way to is sample from the data about less then the actual size. For example, if you have a dataset with n
points, and time_len = 20
, then you should sample less than n/time_len
time. Else, there will be too much duplicate data for the model.
|
|
We can see a sample Data-Point from the (X, y) pair.
|
|
|
|
Now, we can select any window from X
and plot that as a time series line plot to get a visual representation of each sample window. The first window samples from each activity label (“Sitting”, “Standing”, “Walking”, “Jogging”, “Downstairs” and “Upstairs”) is visualized below.
Sitting
Standing
Walking
Jogging
Downstairs
Upstairs
Observation from the visualization
One can clearly distinguish between the difference of the data according to their labels. More heavy tasks (like jogging and going upstairs) is far more spiky than easier tasks (like sitting and standing). On the other hand, going downstairs and upstairs are very similar in spikes.
But, if you look closely, for the diagram of going upstairs, the values of y (visualized in red) is most of the time higher than the values of z (visualized in green). And in case of going downstairs, the values of z (again visualized in green) is most of the time higher than the values of y (visualized in red). This can be learned by any modern machine learning classifier.
Conclusion
This is it for the visualization part. Later in another post, I will continue on this dataset to train different models and see which works better and why.
The dataset is provided by the North South University ACM SC for the Datathon segment of a Computer Science Festival named Technovation 2.0 held at the premises of North South University. My team, KUET MANJARO consisting of 2 members, become the runners up of the event among 30+ national teams from several universities consisting of 4 members.
The champion of the first-ever university-level Datathon in Bangladesh was Team HardMax from North South University, with the Runner-Up KUET_Manjaro from KUET.
– Official Website of NSU