A Trivial Visualization with Plotly Express

This post demonstrates a trivial way to perform data analysis and extract useful insights. This example shows some effective analysis using rigorous data visualization, hypothesis testing to solve a prediction problem, and building regression/classification model to learn something interesting.

I will be using python because it is the most comprehensive way for data analysis in my opinion.

Python modules used for this posts:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
ipython
ipywidgets
jupyterlab==2.2.6
matplotlib==3.3.2
numpy==1.19.1
pandas
plotly==4.12.0
scikit-learn
tensorflow==2.3.1
tqdm

Dataset description

The dataset used for this post contains the following information:
Raw accelerometer sensor data is collected from the smartphone and smartwatch at a rate of 20Hz. It is collected from 29 test subjects as they perform 6 activities (walking, jogging, ascending stairs, descending stairs, sitting, and standing) over a span of time.

Attribute Information

subject-id: ID describing the participant
activity-code: timestamp: Unix time (integer)
x: represents the sensor reading (accelerometer) for the x dimension
y: represents the sensor reading (accelerometer) for the y dimension
z: represents the sensor reading (accelerometer) for the z dimension

The provided dataset is a simple text file where each line represents a data point having the above attributes. Below is a small portion from the first line of the file.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
33,Upstairs,49591302292000,0.38136974,13.865514,1.2258313;
33,Upstairs,49591412308000,-2.1111538,10.147159,0.88532263;
33,Upstairs,49591522323000,0.38136974,5.366417,0.6537767;
33,Upstairs,49591632309000,1.1849703,4.0180025,3.5957718;
33,Upstairs,49591742294000,1.334794,17.011814,4.0588636;
33,Upstairs,49591852310000,3.568531,11.073342,0.6946377;
33,Upstairs,49591952316000,1.1849703,6.742072,0.10896278;
33,Upstairs,49592052292000,0.6537767,7.1234417,0.6946377;
33,Upstairs,49592112228000,1.5390993,8.076866,0.313268;
33,Upstairs,49592222305000,4.3312707,13.102775,0.95342433;
33,Jogging,49105962326000,-0.6946377,12.680544,0.50395286;
33,Jogging,49106062271000,5.012288,11.264028,0.95342433;
33,Jogging,49106112167000,4.903325,10.882658,-0.08172209;
33,Jogging,49106222305000,-0.61291564,18.496431,3.0237172;
33,Jogging,49106332290000,-1.1849703,12.108489,7.205164;
33,Jogging,49106442306000,1.3756552,-2.4925237,-6.510526;
33,Jogging,49106542312000,-0.61291564,10.56939,5.706926;
...

Libraries used

So first things first, we will import all the necessary modules we need. Often this part starts from blank, slowly growing as we start feeling the need of a specific task and import related modules.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
# Basic data manipulation
import math
import numpy as np
import pandas as pd

# Nice progress bar
from tqdm import tqdm_notebook as tqdm

from tensorflow.keras.utils import to_categorical # convert to one-hot-encoding

# For Data preprocessing
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

# For a learning model
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, LSTM
from tensorflow.keras import regularizers

# Suppress warnings
import warnings
warnings.filterwarnings('ignore')

# For visualization in jupyterlab
import plotly.io as pio
pio.renderers.default = "jupyterlab"
import plotly.express as px
import plotly.graph_objects as go
import plotly.figure_factory as ff

Load the data into memory

First things first, we need to load to data into memory to do futher processing. This part is pretty easy if there is a very small size single file. But this part could be very complex in case of large dataset. You might have to load the data in chunks (pandas has option for that), or take samples from the whole dataset to fit it into memory.

I prefer to do this step in an if-else block. I keep a boolean variable named FAST_LOAD initially defaulted to False at the top of the code. After some trial and errors with the loading of the data and finalizing the loading section, I save a kind of “snapshot” of the loaded data into disk so that I can easily load the data quickly from next time. Then I make the FAST_LOAD variable to true. Next time I run that code cell, the data gets loaded instantly, no matter how large it was in the first place.

The train.txt file does not seem to be a common csv file that we could load with pandas. CSV files usually end with a newline at each row. But this one has ; (semicolon). After reading the data as string, we can replace the ; with \n and then try to load with pandas as csv. Below is the cell:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
FAST_LOAD = True

if FAST_LOAD == False:
    with open('train.txt', 'r') as f:
        data = f.read().replace(';','\n')

    with open('train_proc.txt', 'w') as f:
        f.write(data)
        
    df = pd.read_csv(
        'train_proc.txt',
        header=None,
        sep=',',
        names=['id', 'activity', 'timestamp', 'x', 'y', 'z'],
        skip_blank_lines=True, 
    )

else:
    df = pd.read_csv(
        'train_proc.txt',
        header=None,
        sep=',',
        names=['id', 'activity', 'timestamp', 'x', 'y', 'z'],
        skip_blank_lines=True, 
    )

df.head()

1
2
3
4
5
6
   id  activity       timestamp         x          y         z
0  33  Upstairs  49591302292000  0.381370  13.865514  1.225831
1  33  Upstairs  49591412308000 -2.111154  10.147159  0.885323
2  33  Upstairs  49591522323000  0.381370   5.366417  0.653777
3  33  Upstairs  49591632309000  1.184970   4.018003  3.595772
4  33  Upstairs  49591742294000  1.334794  17.011814  4.058864

As our dataset is in a well-organized pandas DataFrame, we can get a lot of information with the help of pandas library!

1
df.describe()

1
2
3
4
5
6
7
8
9
                 id     timestamp             x             y             z
count  1.098081e+06  1.098081e+06  1.098081e+06  1.098081e+06  1.098080e+06
mean   1.886074e+01  3.340610e+13  6.625116e-01  7.255568e+00  4.111336e-01
std    1.021471e+01  4.945179e+13  6.849173e+00  6.745983e+00  4.754197e+00
min    1.000000e+00  0.000000e+00 -1.961000e+01 -1.961000e+01 -1.980000e+01
25%    1.000000e+01  2.018902e+12 -2.870000e+00  3.170000e+00 -2.220000e+00
50%    1.900000e+01  9.719432e+12  2.700000e-01  7.930000e+00  0.000000e+00
75%    2.800000e+01  4.995661e+13  4.440000e+00  1.156000e+01  2.720000e+00
max    3.600000e+01  2.093974e+14  1.995000e+01  2.004000e+01  1.961000e+01

1
df.info()

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
RangeIndex: 1098081 entries, 0 to 1098080
Data columns (total 6 columns):
 #   Column     Non-Null Count    Dtype  
---  ------     --------------    -----  
 0   id         1098081 non-null  int64  
 1   activity   1098081 non-null  object 
 2   timestamp  1098081 non-null  int64  
 3   x          1098081 non-null  float64
 4   y          1098081 non-null  float64
 5   z          1098080 non-null  float64
dtypes: float64(3), int64(2), object(1)
memory usage: 50.3+ MB

Data Cleanup Before Visualization

We should not go directly to visualization before at least being sure that the visualization code will work properly. For example, a single null value the dataset will throw an error while trying to visualize it with matplotlib. Fortunately, plotly tries to handle most of these problems for use. But still this is always a good practice to make sure that the data does not have any extremely high or low value, or in some cases, zero values.

1
2
3
# Make sure there is no zeros
df = df[(df['x'] != 0) & (df['y'] != 0) & (df['z'] != 0)]
df[(df['x'] == 0) & (df['y']==0) & (df['z']==0)].count()

1
2
3
4
5
6
7
id           0
activity     0
timestamp    0
x            0
y            0
z            0
dtype: int64

All rows containing zeros are gone. Now let us check for null values.

1
2
3
4
5
6
# Checking missing data
total = df.isnull().sum().sort_values(ascending=False)
percent = (df.isnull().sum()/df.isnull().count()).sort_values(ascending=False)

missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data.head(20)

1
2
3
4
5
6
7
           Total       Percent
z              1  9.296592e-07
y              0  0.000000e+00
x              0  0.000000e+00
timestamp      0  0.000000e+00
activity       0  0.000000e+00
id             0  0.000000e+00

We can see that there are some null values in the z column. We can remove that easily. Withing removing them, there are other options like replacing the value with average or median, but in this context, the nulls are very small, so removing them will not hurt that much.

1
df.dropna(inplace=True)

Data visualization

Let us start with the class imbalance property check. This is to check if the data is equally divided to all the target column values (in this case activity column). Below is the code to check it on the dataframe.

1
2
3
4
5
6
7
8
9
categorical = True
target_col = 'activity'

if categorical: # Histogram is preferred if target is discrete
    fig = px.histogram(df, x=target_col)
else:
    fig = ff.create_distplot([df[target_col]], group_labels=[target_col], curve_type='normal')
    
fig.show()

There is class imbalance problem in the dataset. Deep learning based approaches usually do not work well with class imbalanced dataset. The easiest way to solve this problem is by sampling the data while training (but often this is not enough, because it can overfit the model to some classes).

1
df_sampled = df.groupby('activity', group_keys=False).apply(lambda x: x.sample(min(len(x), 200)))

Now to determine the difference in the sparsity of the data, we can create a scratter plot. For 2D and 3D data (this dataset has 3 numeric attribute: x, y, z), we can directly plot the data. But when the data has more that 3 dimensions, we either need to plot a part of the dimensions, or we can use a dimension reduction method like PCA (Principles Component Analysis).

1
px.scatter_3d(df_sampled, x='x', y='y', z='z', color='activity')

Now for trainning a model, we need the data attributes (usually named X) and their corresponding labels (usually named y).

Time series Data

Because the given data is a very long time series data, we can create a window of length time_len. Then, we can sample time_len consecutive data-points (x, y, z) to get single windows from the dataset. Although we can do this as much as we want, but a good way to is sample from the data about less then the actual size. For example, if you have a dataset with n points, and time_len = 20, then you should sample less than n/time_len time. Else, there will be too much duplicate data for the model.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
WRITTEN = False

if WRITTEN == False:
    X = []
    y = []
    time_len = 20

    c = 0
    for i in tqdm(range(0, len(df)-time_len-1, time_len)):
        sample = df[i:i+time_len]
        
        if sample['activity'].values[0] == sample['activity'].values[time_len-1]:
            X.append(sample[['x', 'y', 'z']].values)
            y.append(sample['activity'].values[0])
        else:
            c += 1
    print('Numer of skipped :', c)

    X = np.array(X)
    y= np.array(y)
    np.save('X_train', X)
    np.save('y_train', y)
    
else:
    X = np.load('X_train.npy')
    y = np.load('y_train.npy')
    
print("Total trainable windows: ", len(X))

Preprocessed Window

We can see a sample Data-Point from the (X, y) pair.

1
print(X[10], y[10])

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
[[  3.8273177   18.04696      6.851035  ]
 [  0.08172209  -7.8861814   -8.19945   ]
 [ -1.525479    17.352324    -1.8387469 ]
 [ -1.4573772   14.056199     2.982856  ]
 [ -6.701211     4.5628166    3.1054392 ]
 [ -1.56634      1.2666923    0.88532263]
 [ -0.95342433  19.57244      8.349273  ]
 [ -0.9125633   -3.9499009   -2.982856  ]
 [ -2.0294318   11.413852    15.4046135 ]
 [  7.8589406   18.700737     6.0882955 ]
 [ -1.1441092   16.548723     6.742072  ]
 [  4.1814466    3.8681788   -0.14982383]
 [ -3.5276701   18.087822    -3.8681788 ]
 [  5.6252036    9.084772     6.6331096 ]
 [  2.2201166   -5.8158884  -10.106298  ]
 [ -9.193735     3.568531    -6.238119  ]
 [  3.173541    15.241169     1.6480621 ]
 [  5.284695     3.405087     1.2258313 ]
 [  1.4573772   10.501288    -1.4165162 ]
 [  3.5276701   19.19107      4.903325  ]] Jogging

Now, we can select any window from X and plot that as a time series line plot to get a visual representation of each sample window. The first window samples from each activity label (“Sitting”, “Standing”, “Walking”, “Jogging”, “Downstairs” and “Upstairs”) is visualized below.

Sitting

Standing

Walking

Jogging

Downstairs

Upstairs

Observation from the visualization

One can clearly distinguish between the difference of the data according to their labels. More heavy tasks (like jogging and going upstairs) is far more spiky than easier tasks (like sitting and standing). On the other hand, going downstairs and upstairs are very similar in spikes.

But, if you look closely, for the diagram of going upstairs, the values of y (visualized in red) is most of the time higher than the values of z (visualized in green). And in case of going downstairs, the values of z (again visualized in green) is most of the time higher than the values of y (visualized in red). This can be learned by any modern machine learning classifier.

Conclusion

This is it for the visualization part. Later in another post, I will continue on this dataset to train different models and see which works better and why.

The dataset is provided by the North South University ACM SC for the Datathon segment of a Computer Science Festival named Technovation 2.0 held at the premises of North South University. My team, KUET MANJARO consisting of 2 members, become the runners up of the event among 30+ national teams from several universities consisting of 4 members.

Datathon segment of Technovation 2.0

The champion of the first-ever university-level Datathon in Bangladesh was Team HardMax from North South University, with the Runner-Up KUET_Manjaro from KUET.
– Official Website of NSU