We are a group of four information technology students, two undergrad, and two masters, from Carnegie Mellon University. Our career paths have led us all to the Data Pipeline master course offered at the Human Computer Interaction Master Program. In this course we have learned may steps of the data pipeline, all leading to our final project which you have before you.
For our final project, which we have to encompass the whole data pipeline, we chose the Yelp Dataset Challenge. We were attracted to this set due to recent publicity about ho the city of Pittsburgh is considered one of the top food cities. Our task in this project is to explore and gain some insightful insights on the data. For this purpose, we are taking on the challenge of what it takes to beat Yelp.
Since we wanted to distinguish what might differentiate the top restaurants from the weaker ones, we needed to find some sort of defining the feature. Upon further exploration, we find out that star rating isn't the attribute that best defines restaurants since the majority of the top restaurants are 4 stars. So, to solve this problem we decided to take the very ambiguous ranking algorithm that Yelp uses to determine more insight about the top restaurants. To do this we scraped the Yelp website for "Best Restaurants in Pittsburgh, Pa," to determine what they are. We only received the top 100 restaurants which are more than half of the total restaurants and the rest we gave them the same ranking. Because there are over a thousand, and that in itself might not be very descriptive we added bins of size 10 to minimize the top 1000 to the top 100 bins. Once, we matched all the bands to the corresponding restaurant, we explored and implore you to explore the relationship that the top restaurants have an average rating and a number of reviews. Using Tableau we explore this, and by selecting the bins you can see correlations, For example selecting groups 0 to 5, you see how they are all in the top right. As you increase the size or go down the ranking group, the lower average number of reviews, and rating as well.
Lastly, we wanted to see if there was anything to be shown about the location of the restaurants. So using a map of Pittsburgh and their ranking you can explore where the top restaurants are, and where the not so high ranking restaurants are.
Yelp gives restaurants the option to add many categories, essentially making them more unique. A question that came to mind was is there some sort of similarity with the categories that restaurants have and their ranking. Since there were so many categories, to do some comprehensive analysis, we decided to use a word cloud that allows us to map the most popular categories, based on the restaurants in specific bins. So, if you select 0 to 1, you can tell that they all share the "American(new)," category. This is supposed to allow you as a user to see what are popular categories, or maybe see what is lacking from the top places. The right column tells users what the best attributes to have to achieve a higher ranking for each of the categories.
In addition to the word cloud, something we had to do was extract the most important attributes of a restaurant that determine is a success and tanking. Using a Gradient Boosting Machine, we extract the top 8 features, so you can see how they do within the selected group.
Along with restaurant and review information, the yelp data set also included check-in information. Check-in information is the time that users check in manually via their phones when they are at a particular business. This information could have valuable insight, so our team decides that the best way to go about exploring the times people check in and where is with a Sankey chart using D3. We explored the top 15 most popular categories and which days of the week yelp reviewers tend to checkin the most frequently.
There is no surprise that Friday had the most checkins and Sunday/Monday had the least. One surprising find was that Thursday actually had more checkins compared to Saturday.
In the Sankey charts below we explore the connections of checking in regards to the top 15 categories in Pittsburgh, and the time of day. Below that is another Sankey chart that connects the top 15 categories and the time of day. We can see that there is some interesting connection to be made. For example, the most expected time to check in would have been expected to be at at either lunch time or dinner time, but as it turn out it was between 4:00 pm to 5:00 pm. Another insight we learned is that Thursday and Friday are the most popular check-in time. These insights are not only interesting but, can help any restaurant owner see when the most active yelp users are.
Both day of week and times of day data can be combined for users to find out when the restaurants should be at their best shape in order to achieve a higher rating/ranking.
Our whole team learned a lot about web design and web development and how to present information and data to users throughout the whole project.
We learned a lot about Data visualization (D3, Sankey, Tableau), Data scraping, and creating a data narrative.
We learned a lot of new methods for data cleaning, dimentionality reduction, and other preprocesses.
We leanred a lot about machine learning as well as gradient descent throughout this project to turn messy raw data into insightful data applications.
Our findings from the data show that categories of restaurants don't really affect ranking mainly because the ranking was based on the search query: "Best Restaurants"
We also learned that the top ranked restaurants are not all the same type and category, since variety is always good!
We surprisingly found out that the most popular checkin days are actually Thursday and Friday and 3-5 PM are the most popular times.
We learned from this project that the number of reviews, review stars, and certain features/attributes affect ranking the most.