Tutorial Click Path

Analyze Click Stream Data

Tutorial Click Path Analysis is a basic lesson on how to analyze user sessions using Datameer from the raw log files.

 

About the Sample Data

Name/DescriptionTypeSize
Clickstream_Data (Weblogs)Raw Apache Log File95,000 records
Customer_ProfileMySQL Database delivered as CSV20 records


The First Dataset

The Clickstream Data is a File Upload using the Apache connector.

Parse using the pattern:  %h %l %u %t "%r" %>s %b  

Schema notes:

  • Exclude: logName
  • Change Column Title: requestTime to timeEnglishFormat 
  • Empty Value Placeholder: (Null, - ) without parenthesis

After the upload was completed, you can open the workbook Clickstream Analysis to view the sample data.

The Clickstream Data was filtered within the sheet Clickstream_Data where (AND):

  • Request contains html
  • Status equals 200 (success)

  • RemoteUser notblank 

The Clickstream Data was partitioned within the sheet Clickstream_Data for:

  • All data during September 15th
  • Data from 00:00 and before 03:00 on September 16th

Cleaning the logs

Duplicated Clickstream_Data and named the sheet Format.

RefColumn nameFormula
#AIP

#Clickstream_Data!remoteHost 

#BUser

Clickstream_Data!remoteUser 

#CTimeStamp

#Clickstream_Data!timeEnglishFormat 

#DFilename

URL_FILE(#Clickstream_Data!request) 

#EStatus

#Clickstream_Data!status 

#FTraffic

#Clickstream_Data!bytesSent/1024 

Defining a session

This sheet uses functions that define a session as a series of events with less than five minutes of time in between. Any events occurring more than five minutes apart are reported in a new session. This is done using GROUPBYGAP. This sheet also begins the process of parsing out a click path including the landing page and the exit page. This data is used in later aggregations.

This next sheet is named SessionStats.

RefColumn nameFormula
#AUser

GROUPBY(#Format!User) 

#BSessionStart

GROUPBYGAP(#Format!Timestamp;5m) 

#CSessionEnd

GROUPMAX(#Format!Timestamp) 

#DClicks

GROUPCOUNT() 

#EPath

GROUPCONCAT(#Format!Filename;#Format!Timestamp)

This ensures that your results are ordered by timestamp. *note this has a 1,000 row limit  

#FMins

(#SessionEnd-#SessionStart)/60000 

#GLandingPage

LISTELEMENT(#Path;0) 

#HExitPage

LISTELEMENT(#Path;SIZE(#Path)-1) 

This is requesting the last item in a list that is the size of itself. 0 is the index reference for the first item, the last item is always the SIZE(#List)-1 

Average user statistics

The next sheet generates some immediate insights by defining the average number of clicks a user gives to your site and what the average amount of time spent on the site is in minutes. This sheet references data generated in the SessionStats sheet.

This sheet is named UserSessionStats.

RefColumn nameFormula
#AUser

GROUPBY(#SessionStats!User) 

#BavgClicks

GROUPAVERAGE(#SessionStats!Clicks) 

#CavgMins

GROUPAVERAGE(#SessionStats!Mins) 


The Second Dataset

Customer Profile is a CSV file upload with no notable schema, quote character is ( “ ).

Joining the data

The data from both datasets (Customer_Profile data and UserSessionStats) are used together for analysis.

Use Join to join the Customer_Profile data source with the worksheet UsersSessionsStats. This join gives you information like email and profile role data to enrich the user session statistics that were created in the previous sheet. This sheet is named Joined.

 

Determining the #1 landing page

This sheet tells you what the number one most popular landing page is by referencing the SessionStats sheet which pulled out the initial page visit with the LISTELEMENT function.

This sheet is named TopLandingPages. 

RefColumn nameFormula
#ALandingPage

GROUPBY(#SessionStats!LandingPage) 

#BCount

GROUPCOUNT()

A descending sort on the Count column limits to the top 10.

Determining the top fallout pages

Figure out what content people are looking at when they leave your site by referencing the ExitPage from the SessionStats sheet.

This sheet is named TopFalloutPages.

RefColumn nameFormula
#AFalloutPage

GROUPBY(#SessionStats!ExitPage) 

#BCount

GROUPCOUNT() 

A descending sort on the Count column limits to the top 10.

Focusing on significant visitors

Using a filter, you can focus on users that traveled three clicks or more. This sheet is named ClickPath3Plus.

From the SessionStats sheet, a simple filter is created in a new sheet where Clicks greater equals 3. 

Counting the most popular paths that significant visitors follow

Here you determine what the click stream is for users who are clicking 3 or more pages throughout the site. You also determine how many distinct users followed the same path using GROUPCOUNTDISTINCT.

This sheet is named TopClickPaths.

RefColumn nameFormula
#A ClickPath

GROUPBY(#ClickPath3Plus!Path+"")

Adding the +”” Turns this LIST into a string- this is necessary to properly populate the infographic 

#BCount GROUPCOUNT() 
#CUsersFollowed 

GROUPCOUNTDISTINCT(#ClickPath3Plus!User) 

descending sort is created on the Count column limiting to the top 10

Click pairs and dwell time

Using similar methods used to create the general session, click pairs shows you the path a user took and the pairs of pages in that path. For example, page A to B, B to C, C to A, A to C, etc. In addition to pair trends, you can use this time series to understand dwell time on each page to answer questions like, "How long do users spend on this page before they go on to the next?". The Format sheet is referenced for clean results. 

This sheet is named ClickPairs.

RefColumn nameFormula
#AUser

GROUPBY(#Format!User) 

#BsessionStart

GROUPBYGAP(#Format!Timestamp;5m) 

#CPath

GROUP_PATH_CHANGES(#Format!Filename)
This function ensures that only unique page clicks are returned, For example, keeps out records of people who are on a contact page that click the link for that same contact page. 

#DFromJSON_ELEMENT(#Path;0) 
#EToJSON_ELEMENT(#Path;1) 
#FDwellTimeSecsDENULLIFY(GROUP_DIFF(#Format!Timestamp))/1000

This function turns nulls to ‘0’s, calculates the difference and turns it from MS to Seconds  


Determining the top page moves

Now that you have click pairs, you can determine what the top page moves are from one page to another with simple grouping and counting; the ClickPairs page is referenced for this analysis.

This sheet is named TopMoves.


RefColumn namesFormula
#AFrom

GROUPBY(#ClickPairs!From) 

#BTo

GROUPBY(#ClickPairs!To)

#CCount

GROUPCOUNT()

Determining notable dwell time

The time series created in ClickPairs gave valuable information about how long users spend on given pages and where they go from those pages. This sheet gives you insights into the areas that users spend the most time. A filter and sort are applied to determine the top pages.

This sheet is named LongestDwellTime.


RefColumn nameFormula
#APage

GROUPBY(#ClickPairs!From) 

#BDwellSeconds

GROUPAVERAGE(#ClickPairs!DwellTimeSecs) 

A simple filter is made where Page doesn't equal external.

A sort on the DwellSeconds column is made limiting to the top 10. 


Building the Infographics

Visitor statistics

Widget typeSheetData1Data2Label
Grouped Bar ChartJoinedavgClicksavgMinsEmail

Top click paths

Widget typeSheetStartEndWeight
Circular NetworkTopMovesFromToCount

Click flow

Widget typeSheetSourceTargetStrength
NetworkTopMovesFromToCount

Top landings

Widget typeSheetData1Data2
TableTopLandingPageLandingPageCount

Click paths

Widget type



TableTopClickPathsClickPathCountUsersFollowed

Stickiness

Widget typeSheetData1Data2
TableLongestDwellTimePageDwellSeconds