Collaborative Filtering Using Singular Value Decomposition (SVD): Amazon Product Review

Blessing Magabane
7 min readAug 26, 2020

--

What should I recommend you?

https://1000logos.net/amazon-logo/

What are recommender systems?

Recommender systems or engines are algorithms that are aimed at suggesting relevant items to users. Recommender engines have seen rapid growth over the last decade this is due to FAANG (Facebook, Apple, Amazon, Netflix, Google) and other tech giants. They invested heavily in the research and development of recommender systems with the objective to retain and sell more product to customers. Recommendation systems are a sub class of information filtering. In the age of big data, filtering the right information to customers is key and can increase profit margins

Recommender systems used to be primitive, the mechanism to provide suggestions was solely based on item proximity and similarity. However, with the inception of artificial intelligence and machine learning recommender systems have become more intelligent and deliver high satisfactions among users. There are different types of recommender systems and the main ones are the content-based methods and collaborative filtering methods. Collaborative filtering methods are based on past interaction between users and items. An example of that would be YouTube, the site suggests videos based on the previous views and likes.

On the other hand, we have content-based methods, they make use of user’s features such the age, genre and gender. These features are intrinsic to the user and not from interactions. Content-based recommender systems use a model to make suggestions. Another type of recommender system that is worth mentioning is the hybrid recommender systems they make use of both content and collaborative methods to build a more powerful recommender system.

The main aim of this article is to demonstrate how collaborative filtering recommender systems work. The idea is to recommend music to users based on the rating of the song on an ecommerce platform.

Dataset:

To build a recommender system data is required, data from Amazon product reviews will be used. Amazon is an American ecommerce giant; the ecommerce platform has vast amounts of data about products they are selling. It for this reason that data from Amazon we will be used. The datasets can be found on the following link,

The datasets found on the Amazon repository are fairly large and placed in different categories. For the purpose of this article we will focus only on digital music. We choose music since it easy to understand and relatable to most people.

The datasets contain ratings, text, helpfulness votes, product metadata (description, category information, price, brand, and image features). The format for the reviewer’s data is as follows,

  • reviewerID — ID of the reviewer
  • asin — ID of the product
  • reviewerName — name of the reviewer
  • vote — helpful votes of the review
  • style — a disctionary of the product metadata, e.g., “Format” is “Hardcover”
  • reviewText — text of the review
  • overall — rating of the product
  • summary — summary of the review
  • unixReviewTime — time of the review (unix time)
  • reviewTime — time of the review (raw)
  • image — images that users post after they have received the product

While the product metadata is given by,

  • asin — ID of the product
  • title — name of the product
  • feature — bullet-point format features of the product
  • description — description of the product
  • price — price in US dollars (at time of crawl)
  • image — url of the product image
  • related — related products (also bought, also viewed, bought together, buy after viewing)
  • salesRank — sales rank information
  • brand — brand name
  • categories — list of categories the product belongs to
  • tech1 — the first technical detail table of the product
  • tech2 — the second technical detail table of the product
  • similar — similar product table

Data Processing:

Data Preparation:

The data is stored in json format, in order to work with it we need to convert it to a data frame then feed it to a model. Json formats are fairly complicated to work with but once converted into readable text, they are simple to manipulate.

To load the data into our notebook, we define an empty array to store the json file as shown below,

Once the data is loaded into the array, we need to view the information, we start with viewing the headers as shown below,

The screen print above shows important information such as the rating, reviewerID and the sentiment from the reviewer.

To convert the json data format we simply use the DataFrame() function from pandas library. The process is shown below,

Similarly, the information about the product metadata can be obtained in similar fashion,

Once the reviewer data and the product meta are loaded and converted to a data frame we join the two. We join them using the asin column it is a unique identifier, we use the merge() function to perform the join as shown below,

Data Preprocessing:

Once the data has been prepared, we move on to processing it. Most of the dataset has missing values or the date is not in the correct format. The data processing step is there to remedy this. We start with indexing the date as shown below,

We drop some of the columns that we don’t need such the category, style and fit. They are mostly empty and contain no data.

Then there are columns that contain data but some of the entries have nulls, we fill those nulls with zeros such a column is the vote,

Once the data is processed, the next is to divide it into training, validating and testing.

Matrix factorisation implementation:

The diagram below shows the matrix implementation,

Modeling:

Since we are only focusing on collaborative filtering it is important to look at the different approaches within this category, we have 3 approaches as stated below,

  • User to user
  • Item to item
  • Matrix factorisation

Collaborative filtering methods can be further subdivided into two categories. The first method is the no model, it uses pure proximity association and uses algorithms such as the nearest neighbor’s search. The other method is the model based were an underlying model is used to make recommendations. In this article we are not going to perform any baseline of the model or investigate the bias in the data. However, we will remove datasets that might cause “cold start”. Cold start is the deficiency of user to item interaction information which might lead to incorrect inference and bias in the suggestions.

The model will use singular value decomposition or simply SVD. We are using matrix decomposition techniques to make our recommendations. The screen print below demonstrates the process of separating the data into training, validating and testing.

Running the model.

Fitting the model,

There are different ways to measure the accuracy the of the model. In this article the root mean square error was used and the result was 1.0076. Predictions, from the model are not the best the is high variance and latency in the matrix factorisation.

Conclusion:

Despite all the advanced techniques recommender systems utilise they have a number of drawbacks. For instance, in collaborative methods it is impossible to recommend anything to new users. To work around this setback, one can randomly recommend an item to a new user. The best way will be recommending popular items. It can be observed from the modeling section that the singular value decomposition procedure relied heavily on the interaction between the user and item. Without that information the factorisation is meaningless since we will be dealing with an empty matrix.

To enhance the performance of the SVD we can use hyper parametrization and a sparser matrix to increase computational power thereby optimising performance. Depending on the application of the recommender system a number of architectures can be used.

Next time when you’re streaming content on your favourite platform and it recommends you something that closely related to your preference or taste just know that an algorithm did that and not magic but data science.

The notebook I used to model the problem can be found on the following link,

Reference:

Modeling ambiguity, subjectivity, and diverging viewpoints in opinion question answering systems/Mengting Wan, Julian McAuley/International Conference on Data Mining (ICDM), 2016

Addressing complex and subjective product-related queries with customer reviews/Julian McAuley, Alex Yang/World Wide Web (WWW), 2016

You can follow and contact me on the following platforms,

Twitter :@blessing3ke

LinkedIn : https://www.linkedin.com/in/blessing-magabane

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

Blessing Magabane
Blessing Magabane

Written by Blessing Magabane

A full stack Data Scientist with experience in data engineering and business intelligence.

No responses yet

Write a response