Introduction:

Hello! In this project, I’m breaking down the top content creators on Youtube over 2017-2018. The very first thing i’ll do is load the libraries I’m going to use:

import pandas as pd
import numpy as np
import seaborn as sb
import matplotlib.pyplot as plt
import datetime as dt
import json 
from itables import show
%matplotlib inline

Next, I’m reading the data into a .csv:

ytube = pd.read_csv('C:/Users/bruce/Downloads/archive (1)/USvideos.csv')
ytube = pd.DataFrame(ytube)

Afterwards, here’s what everything looks like:

print(ytube.iloc[0:5])
          video_id                                              title  \
3813   NDBwsMIXlMk  Marvel’s Agents of S.H.I.E.L.D. | Season 5 Pre...   
3868   NDBwsMIXlMk  Marvel’s Agents of S.H.I.E.L.D. | Season 5 Pre...   
3928   NDBwsMIXlMk  Marvel’s Agents of S.H.I.E.L.D. | Season 5 Pre...   
3953   6ZfuNTqbHE8  Marvel Studios' Avengers: Infinity War Officia...   
3954   KXHrCBkIxQQ  ALL THE MONEY IN THE WORLD - Official Trailer ...   
...            ...                                                ...   
39536  X2BfTXIwABo                  MARVEL COMICS 2018: A Fresh Start   
39541  X2BfTXIwABo                  MARVEL COMICS 2018: A Fresh Start   
39546  X2BfTXIwABo                  MARVEL COMICS 2018: A Fresh Start   
39551  X2BfTXIwABo                  MARVEL COMICS 2018: A Fresh Start   
39556  X2BfTXIwABo                  MARVEL COMICS 2018: A Fresh Start   

                     channel_title  category_id    category_title  \
3813          Marvel Entertainment           24     Entertainment   
3868          Marvel Entertainment           24     Entertainment   
3928          Marvel Entertainment           24     Entertainment   
3953          Marvel Entertainment           24     Entertainment   
3954   Sony Pictures Entertainment           24     Entertainment   
...                            ...          ...               ...   
39536         Marvel Entertainment            2  Autos & Vehicles   
39541         Marvel Entertainment            2  Autos & Vehicles   
39546         Marvel Entertainment            2  Autos & Vehicles   
39551         Marvel Entertainment            2  Autos & Vehicles   
39556         Marvel Entertainment            2  Autos & Vehicles   

             publish_time trending_date     views    likes  dislikes  \
3813  2017-11-26 16:00:21    2017-11-27    248455     7174       200   
3868  2017-11-26 16:00:21    2017-11-28    456193     9501       294   
3928  2017-11-26 16:00:21    2017-11-29    537443    10577       341   
3953  2017-11-29 13:26:24    2017-11-30  37736281  1735895     21969   
3954  2017-11-29 20:00:53    2017-11-30    197900     2394       372   
...                   ...           ...       ...      ...       ...   
39536 2018-02-20 15:00:05    2018-02-24    377995     9315       704   
39541 2018-02-20 15:00:05    2018-02-25    385064     9404       712   
39546 2018-02-20 15:00:05    2018-02-26    390321     9485       721   
39551 2018-02-20 15:00:05    2018-02-27    394864     9560       728   
39556 2018-02-20 15:00:05    2018-02-28    400237     9618       732   

       comment_count  comments_disabled  ratings_disabled  err_removed  
3813            1291              False             False        False  
3868            1678              False             False        False  
3928            1971              False             False        False  
3953          241237              False             False        False  
3954             432              False             False        False  
...              ...                ...               ...          ...  
39536           2367              False             False        False  
39541           2391              False             False        False  
39546           2419              False             False        False  
39551           2436              False             False        False  
39556           2447              False             False        False  

[640 rows x 14 columns]

Now, i’m going to import categories from the JSON file, flatten, then convert into data frame

with open('C:/Users/bruce/Downloads/archive (1)/US_category_id.json','r') as f:
    ytube_cat = json.loads(f.read())
    
ytube_cat = (pd.json_normalize(ytube_cat, record_path = 'items')
                .loc[:,['id', 'snippet.title']]
                .rename(columns = {'id':'category_id', 'snippet.title':'category_title'})
            )

ytube_cat = pd.DataFrame(ytube_cat)

print(ytube_cat)
   category_id         category_title
0            1       Film & Animation
1            2       Autos & Vehicles
2           10                  Music
3           15         Pets & Animals
4           17                 Sports
5           18           Short Movies
6           19        Travel & Events
7           20                 Gaming
8           21          Videoblogging
9           22         People & Blogs
10          23                 Comedy
11          24          Entertainment
12          25        News & Politics
13          26          Howto & Style
14          27              Education
15          28   Science & Technology
16          29  Nonprofits & Activism
17          30                 Movies
18          31        Anime/Animation
19          32       Action/Adventure
20          33               Classics
21          34                 Comedy
22          35            Documentary
23          36                  Drama
24          37                 Family
25          38                Foreign
26          39                 Horror
27          40         Sci-Fi/Fantasy
28          41               Thriller
29          42                 Shorts
30          43                  Shows
31          44               Trailers

The merge I want to do to these dataframes isn’t working because the column’s data types do not align, so let’s see if we can change that…

ytube.dtypes
video_id                  object
trending_date             object
title                     object
channel_title             object
category_id                int64
publish_time              object
tags                      object
views                      int64
likes                      int64
dislikes                   int64
comment_count              int64
thumbnail_link            object
comments_disabled           bool
ratings_disabled            bool
video_error_or_removed      bool
description               object
dtype: object

Part 2

ytube_cat.dtypes
category_id       object
category_title    object
dtype: object
ytube_cat['category_id'] = ytube_cat['category_id'].astype('int64')

ytube_cat.dtypes
category_id        int64
category_title    object
dtype: object

I could keep the text based columns (description and tags) if I wanted to do a text analysis; I’ll explore that another time maybe…

# Trimming the fat

ytube = (pd.merge(ytube, ytube_cat)
           .rename(columns = {'video_error_or_removed':'err_removed'})
           .loc[:,['video_id', 'title', 'channel_title', 'category_id', 'category_title', 'publish_time', 'trending_date',
                  'views', 'likes', 'dislikes', 'comment_count', 'comments_disabled', 'ratings_disabled', 'err_removed']]
        )

print(ytube.iloc[0:5])
      video_id                                              title  \
0  NDBwsMIXlMk  Marvel’s Agents of S.H.I.E.L.D. | Season 5 Pre...   
1  NDBwsMIXlMk  Marvel’s Agents of S.H.I.E.L.D. | Season 5 Pre...   
2  NDBwsMIXlMk  Marvel’s Agents of S.H.I.E.L.D. | Season 5 Pre...   
3  6ZfuNTqbHE8  Marvel Studios' Avengers: Infinity War Officia...   
4  KXHrCBkIxQQ  ALL THE MONEY IN THE WORLD - Official Trailer ...   

                 channel_title  category_id category_title  \
0         Marvel Entertainment           24  Entertainment   
1         Marvel Entertainment           24  Entertainment   
2         Marvel Entertainment           24  Entertainment   
3         Marvel Entertainment           24  Entertainment   
4  Sony Pictures Entertainment           24  Entertainment   

         publish_time trending_date     views    likes  dislikes  \
0 2017-11-26 16:00:21    2017-11-27    248455     7174       200   
1 2017-11-26 16:00:21    2017-11-28    456193     9501       294   
2 2017-11-26 16:00:21    2017-11-29    537443    10577       341   
3 2017-11-29 13:26:24    2017-11-30  37736281  1735895     21969   
4 2017-11-29 20:00:53    2017-11-30    197900     2394       372   

   comment_count  comments_disabled  ratings_disabled  err_removed  
0           1291              False             False        False  
1           1678              False             False        False  
2           1971              False             False        False  
3         241237              False             False        False  
4            432              False             False        False  
#Change date format for trending dates 

ytube['trending_date'] = pd.to_datetime(ytube['trending_date'], format='%y.%d.%m')
print(ytube['trending_date'])
0       2017-11-14
1       2017-11-14
2       2017-11-14
3       2017-11-14
4       2017-11-14
           ...    
40944   2018-06-02
40945   2018-06-03
40946   2018-06-04
40947   2018-06-05
40948   2018-06-06
Name: trending_date, Length: 40949, dtype: datetime64[ns]

Now that everything is pretty much formatted and cleaned, i’m going to plot the only the necessary data.

# Specify the kind of plot st
plt.style.use('default')


# Convert 'publish_time' column to datetime
ytube['publish_time'] = (pd.to_datetime(ytube['publish_time'])
                           .dt.tz_localize(None)
                        )

# Filter out the ytube for the years 2017 and 2018
ytube = ytube[(ytube['publish_time'].dt.year >= 2017) & (ytube['publish_time'].dt.year <= 2018)]

# Group the ytube by 'category_title' and 'publish_date' (by year and month), then sum up the 'views'
ytube_grouped = ytube.groupby([ytube['category_title'], ytube['publish_time'].dt.to_period('M')])['views'].sum().reset_index()

# Pivot the ytube to have 'category_title' as columns and dates as index
ytube_pivoted = ytube_grouped.pivot(index='publish_time', columns='category_title', values='views')

# Plot the time series of each category
ax = ytube_pivoted.plot(kind='line', figsize=(15, 10), colormap = 'tab20')

plt.title('YouTube Views By Category (2017-2018)')
plt.xlabel('Publish Time')
plt.ylabel('Views')


# Change legend title: 
legend = ax.legend()
legend.set_title('Category Title')

plt.show()

png

# Group the ytube by 'category_title' and 'publish_date' (by year and month), then sum up the 'views'
ytube_grouped = ytube.groupby([ytube['channel_title'], ytube['trending_date'].dt.to_period('M')])['views'].sum().reset_index()

# Get top 10 channels by total views
top_channels = ytube.groupby('channel_title')['views'].sum().nlargest(10).index

# Filter the data to include only the top 10 channels
ytube = ytube[ytube['channel_title'].isin(top_channels)]

# Group the ytube by 'channel_title' and 'publish_date' (by year and month), then sum up the 'views'
ytube_grouped = ytube.groupby([ytube['channel_title'], ytube['publish_time'].dt.to_period('M')])['views'].sum().reset_index()

# Pivot the data to have 'channel_title' as columns and dates as index
ytube_pivoted = ytube_grouped.pivot(index='publish_time', columns='channel_title', values='views')

# Plot the time series of each channel
ax = ytube_pivoted.plot(kind='bar', figsize=(15, 10))

# Change legend title: 
legend = ax.legend()
legend.set_title('Channel Title')

plt.title('YouTube Views By Channel (Top 20 channels, 2017-2018)')
plt.xlabel('Publish Time')
plt.ylabel('Views')
plt.show()

png

# Group the ytube by 'category_title' and 'publish_date' (by year and month), then sum up the 'views'
ytube_grouped = (ytube.groupby([ytube['channel_title'], ytube['likes'], ytube['dislikes']])['views']
                     .sum()
                     .reset_index())

# Get top 20 channels by total views
top_channels = (ytube.groupby('channel_title')['views']
                    .sum()
                    .nlargest(20)
                    .index)

# Filter the data to include only the top 20 channels
ytube = ytube[ytube['channel_title'].isin(top_channels)]

# Group the data by 'channel_title' and 'publish_date' (by year and month), then sum up the 'views'
ytube_grouped = (ytube.groupby([ytube['channel_title'], ytube['publish_time']])['likes']
                      .sum()
                      .reset_index())

# Pivot the data to have 'channel_title' as columns and dates as index
ytube_pivoted = ytube_grouped.pivot(index='publish_time', columns='channel_title', values='likes')

# Create a scatter plot for each category
fig, ax = plt.subplots(figsize=(15, 10))

# Specify colormap
cmap = plt.get_cmap('tab20')  # Colormap with 20 different colors
colors = cmap(np.linspace(0, 1, len(ytube_pivoted.columns)))

for i, column in enumerate(ytube_pivoted):
    ax.scatter(ytube_pivoted.index, ytube_pivoted[column], label=column, color=colors[i])

    # Find the maximum value for each channel and label it
    max_val_index = ytube_pivoted[column].idxmax()
    max_val = ytube_pivoted.loc[max_val_index, column]

    ax.text(max_val_index, max_val, f'{max_val:.0f}', fontsize=8)

    
plt.title('YouTube Videos By Top 20 Artists, Date and Likes (2017-2018)')
plt.xlabel('Time Period')
plt.ylabel('Likes')


legend = ax.legend(frameon = False)
legend.set_title("Channel Title")

plt.show()

png