Analyzing Ad Frauds in YouTube videos

Abdur Rehman
8 min readDec 6, 2019
Photo by Christian Wiediger on Unsplash

What is Ad Fraud?
The internet can be a manipulative business sometimes and the data set that we have explores how people can manipulate us using ad frauds. So what exactly are Ad Frauds? The crux of the matter is that basically people on YouTube upload post videos regarding ads related to making money and gaining fame. Ad Fraud is when those people upload links which in fact are malicious and harmful to us. Often they ask for our Credit Card information and then use that data against us.

Task
Our task is to analyze those websites. We have focused only on the links of websites and ignore all app-related links. However, most of the links that we analyzed were removed due to them being fraud so we had to change the question a little. We analyzed click fraud through analyzing the description and check which links are down. Our basic analysis is to check whether the websites are harmful or not i.e contain malware. Or do they require a user to enter sensitive information, for example credit card information. We also have performed analysis based on patterns, for example common words for fraudulent ads and benign ads

Data Introduction
Our data set consists of 1,065 different YouTube videos and whether the websites attached are fraudulent or benign. To be exact, there are a total of 5 features which are discussed below:
1) video_Id: This is the unique identifier of each YouTube video. This helps to distinguish among all other videos.
2) video_url: This column contains the links to all the videos. The initial part is the same for all, the difference in each is the unique video_Id of each
3) description: This contains the description of each video as written by the person who uploaded the video.
4) links: This contains all the links that are mentioned in the description. Our main focus is on these links.
5) classification: There are two types of classifications, Fraud(f) and Benign(b). Fraud are those which contain malicious activities whereas benign are those that are harmless to us.
A snapshot of the data is attached below:

Data Preparation
The data set that we received was relatively clean but still needed some adjustments. We checked for rows that did not contain any description or rows where the list regarding links was empty. We found out that if there is no description the list is empty. So we decided to drop all those rows which did not contain a description. We decided to drop them because often people go to descriptions for such videos and they are often ignored if a description is missing. For those in which the links were missing, we extracted them from the description:

Use of regex to extract links

Next, we looked at whether the links mentioned work or not. For that purpose, we ran a piece of code to check whether they work or not

This code parses the URL and then the get alive_dead function returns the HTTP status of the URL, it is matched with 200 to check whether a link is working or not.

Through this, we found out that out of the 6,228 links mentioned, only 1,444 links are working.

We added another column to our data set know as w_links. This contained the working links mentioned in the description. The updated data set is attached below:

Snapshot of updated data set

As we can see there are a lot of empty lists in the working links column. Furthermore, we removed the links from the description column using regex as it would interfere with our model.

Next, we added another column ‘closed_links’. This equated to the number of links minus the sum of working links and it was stored for every video.

After doing all these steps we now have a total of 7 columns:
1)videoId
2)video_url
3)description
4)links
5)classification
6)working_links(list of links that responded, this an added column)
7)closed_links(sum of total links minus working links, this is an added column)

Now we can go on to do our analysis.

Exploratory Data Analysis

Since the data was mostly categorical and consisted of non-numeric data, there wasn’t much EDA to do except for seeing whether videos were benign or fraudulent

Classification with regards to working links

As we can see in the bar chart, most of the videos were, in fact, harmless and so our model will check whether the videos promote ad fraud based on the description.

Feature Engineering
Initially our features consisted of all the unique words and we tested our model on those. However that resulted in a massive difference between the training accuracy and the test accuracy. We believe this was due to over-fitting. Our features included the words that were used in and someone of them interfered with our model.

To improve our model we used an inbuilt function known as SelectKBest. This takes in 2 parameters, the score function , i.e we used chi square test. The second parameter is the number of scores we want it to return, for example we have chosen 940 so the function will return the best 940 features.

So you’re probably wondering why did we choose 940 exactly, why not more, why not less. The answer to this is that we used a trial and error method to determine the optimal number of features. We started with 100 and moved forward. Our goal was to improve accuracy and to make sure it did not over-fit. After running several tests the optimal number of features were between 900 and 950. So we decided to keep the count at 940. Furthermore, this helps to reduce training time as the number of features drops from eleven thousand to nine hundred and forty.

Machine Learning:

Model Training:
In this step, we will use our data to incrementally improve our model’s ability to predict whether a link in the description of the Youtube link is fraudulent or benign. I defined three things that would be useful in predicting which links are fraudulent and which are benign.
i) Description
ii) Target Variable taking value 1 for fraudulent websites and 0 otherwise

Model Selection:

First of all the data was split into two parts: the initial split will be 70:30 for the train test.In this process we used logistic regression which is basically used when our dependent variable is categorical, which applies to our predictions as well. Next, we used 5-fold cross validation to evaluate model performance on training data set.

Earlier we used TF-IDF vectorizer and features will be words in the description.

Results:

The accuracy score on test data is: 0.8242811501597445
The accuracy score on train data is: 0.9931318681318682

When the accuracy score of test data is much lower than the train data that implies that our model is over-fitting. What we need to do is that we will make another model with the following characteristics:
i) We selected the best 940 features(using Selectkbest).
ii) Moreover, we are also adding the number of closed links as another feature. The basic idea behind adding this feature is that a lot of fraudulent links are not working or have been taken down. On the other hand, a lot of benign links are still working. As our dataset is outdated this assumption makes perfect sense.

Why did we use TF-IDF vectorizer over Count Vectorizer?

Firstly, TF-IDF is better because instead of returning a count, it returns a score between 0 and 1 for every feature. Secondly, what it does is that it checks the frequency of the feature in the whole document and in selected documents. If it appears a lot in the document then that feature is penalized. A similar example would be stopwords like(an , the )which appear a lot in text but not giving any insight about text. This helps us in training our model better.

Results:
The accuracy with the number of closed website links for the Training set is 0.8447802197802198
The accuracy with the number of closed website links for the Test set is 0.8210862619808307

Even though the accuracies are less than our model 1 but this is a good model because it is not overfitting or underfitting. This can be further elaborated using the bar chart below, how there is a big difference in accuracy when we are not adding the feature of closed links in our model.

Findings from model:
C
onfusion matrix for prediction of fraudulent links using closed links count and without using closed links count.

As shown in the confusion matrix above our second model predicts that there were actually 620 benign links, however, our model predicted 581 links were benign and predicted 39 wrong that were actually benign but our model predicted fraudulent. The same applies for fraudulent links as well.

Conclusion:
The goals
of this project were to evaluate which features of the YouTube link can be used to predict whether the links (in the description) are benign or fraud. There could be many features we could have added e.g. comments, subtitles, etc. however, since our data set was outdated majority of the YouTube link videos were already deleted (5000 out 6500). We cleaned our data to process it according to our need, and through exploratory data analysis and machine learning, we were able to generate a model that was a good fit and could efficiently predict which links were a fraud and which were benign. Instead of finding which websites are actually malicious what we did was that we extracted the links that were closed and used that in our model as an extra feature. Secondly, using TF-IDF on description, selectKBest on vectorized features obtained and later using chi-squared as score function we got 940 features which were used in our model which we believe is efficient and a good predictor of which links are a fraud or benign.

Looking Ahead.
We generated a good model, however, there were a lot of problems we faced which could be solved for the future. The dataset was outdated and that is why a lot of links and youtube were either closed or deleted. Moreover, the data was not at all comprehensive: it should have a lot of other features we need to take into account as well.If we want to generate a good model for the future we should have an updated dataset which is comprehensive as well.

Contributor:

Bilal Naeem (2100248@lums.edu.pk)

Sarmad Chandio(21100223@lums.edu.pk)

--

--