Welcome to Bon Yinzers

Bringing you the secret to attracting Pittsburgers

Find Out How

About us


Who we are

We are a group of four information technology students, two undergrad, and two masters, from Carnegie Mellon University. Our career paths have led us all to the Data Pipeline master course offered at the Human Computer Interaction Master Program. In this course we have learned may steps of the data pipeline, all leading to our final project which you have before you.

What's our Purpose

For our final project, which we have to encompass the whole data pipeline, we chose the Yelp Dataset Challenge. We were attracted to this set due to recent publicity about ho the city of Pittsburgh is considered one of the top food cities. Our task in this project is to explore and gain some insightful insights on the data. For this purpose, we are taking on the challenge of what it takes to beat Yelp.

Our Data

Sources: Yelp Data Challenge & Yelp API

1,989

Pittsburgh Restaurants

54

Neighborhoods

192

Categories

79

Attributes

Understanding Yelp


Exploring Top-Ranked Restaurants

We first wanted to see the validity of Yelp's own ranking based on number of reviews and review stars.

Since we wanted to distinguish what might differentiate the top restaurants from the weaker ones, we needed to find some sort of defining the feature. Upon further exploration, we find out that star rating isn't the attribute that best defines restaurants since the majority of the top restaurants are 4 stars. So, to solve this problem we decided to take the very ambiguous ranking algorithm that Yelp uses to determine more insight about the top restaurants. To do this we scraped the Yelp website for "Best Restaurants in Pittsburgh, Pa," to determine what they are. We only received the top 100 restaurants which are more than half of the total restaurants and the rest we gave them the same ranking. Because there are over a thousand, and that in itself might not be very descriptive we added bins of size 10 to minimize the top 1000 to the top 100 bins. Once, we matched all the bands to the corresponding restaurant, we explored and implore you to explore the relationship that the top restaurants have an average rating and a number of reviews. Using Tableau we explore this, and by selecting the bins you can see correlations, For example selecting groups 0 to 5, you see how they are all in the top right. As you increase the size or go down the ranking group, the lower average number of reviews, and rating as well.


Exploring Restaurant Locations

Then we wanted to explore if location had any impact on ranking of the restaurants.

Lastly, we wanted to see if there was anything to be shown about the location of the restaurants. So using a map of Pittsburgh and their ranking you can explore where the top restaurants are, and where the not so high ranking restaurants are.

Feature Analysis


Exploring Categories and Attributes

After understanding Yelp, we dove deep into the restaurant attributes and analyzed which attributes benefitted each category the most.

Yelp gives restaurants the option to add many categories, essentially making them more unique. A question that came to mind was is there some sort of similarity with the categories that restaurants have and their ranking. Since there were so many categories, to do some comprehensive analysis, we decided to use a word cloud that allows us to map the most popular categories, based on the restaurants in specific bins. So, if you select 0 to 1, you can tell that they all share the "American(new)," category. This is supposed to allow you as a user to see what are popular categories, or maybe see what is lacking from the top places. The right column tells users what the best attributes to have to achieve a higher ranking for each of the categories.

In addition to the word cloud, something we had to do was extract the most important attributes of a restaurant that determine is a success and tanking. Using a Gradient Boosting Machine, we extract the top 8 features, so you can see how they do within the selected group.


Exploring Categories and Checkin Days

We then used a Sankey to explore which days of week people tend to checkin to each category of restaurants.

Along with restaurant and review information, the yelp data set also included check-in information. Check-in information is the time that users check in manually via their phones when they are at a particular business. This information could have valuable insight, so our team decides that the best way to go about exploring the times people check in and where is with a Sankey chart using D3. We explored the top 15 most popular categories and which days of the week yelp reviewers tend to checkin the most frequently.

There is no surprise that Friday had the most checkins and Sunday/Monday had the least. One surprising find was that Thursday actually had more checkins compared to Saturday.


Exploring Categories and Checkin Times

We also used a Sankey to further explore times of day people tend to checkin to each category of restaurants.

In the Sankey charts below we explore the connections of checking in regards to the top 15 categories in Pittsburgh, and the time of day. Below that is another Sankey chart that connects the top 15 categories and the time of day. We can see that there is some interesting connection to be made. For example, the most expected time to check in would have been expected to be at at either lunch time or dinner time, but as it turn out it was between 4:00 pm to 5:00 pm. Another insight we learned is that Thursday and Friday are the most popular check-in time. These insights are not only interesting but, can help any restaurant owner see when the most active yelp users are.

Both day of week and times of day data can be combined for users to find out when the restaurants should be at their best shape in order to achieve a higher rating/ranking.

What we Learned


Web design

Our whole team learned a lot about web design and web development and how to present information and data to users throughout the whole project.

Data

We learned a lot about Data visualization (D3, Sankey, Tableau), Data scraping, and creating a data narrative.

Preprocess Data

We learned a lot of new methods for data cleaning, dimentionality reduction, and other preprocesses.

Machine Learning

We leanred a lot about machine learning as well as gradient descent throughout this project to turn messy raw data into insightful data applications.

Categories

Our findings from the data show that categories of restaurants don't really affect ranking mainly because the ranking was based on the search query: "Best Restaurants"

Top Types

We also learned that the top ranked restaurants are not all the same type and category, since variety is always good!

Checkins

We surprisingly found out that the most popular checkin days are actually Thursday and Friday and 3-5 PM are the most popular times.

Reviews and Features

We learned from this project that the number of reviews, review stars, and certain features/attributes affect ranking the most.

Our Approach/Methodology


4 C's


Meet our Team


...

Richard Huang

Information Systems

...

Sebastian Guerrero

Information Systems

...

Ankit Gupta

MISM

...

Li (Sophia) He

Public Policy

Predict Expected Ranking


Fill out the features of your restaurant

Take Out

Reservations

Catering

Wifi

Drive-Thru

Alcohol

Neighborhoods

Attire

Price Range

Noise Level