Are You Asking The Right Questions - Towards Data Science

4/19/2019 Are you asking the right questions – Towards Data Science
Are you asking the right questions?

Thiagarajan Nagarajan Follow
Apr 3 · 5 min read
In one of the episodes of “Brain Games with Jason Silva”, witnesses of a

car crash are asked to estimate the speed of the car involved. Some
estimated that the speed was about 10–20 mph, while some guessed it
was about 40–50 mph. Why do you think the estimates between the
two groups were so far off from one another?
Maybe the groups were of different ages, or maybe they viewed it from
different distances. What do you think?
Turns out the difference in answers was not caused by the ability or
inability of the two groups to estimate speed, but by the questioner. The
question posed to the group that estimated 10–20 mph was “How fast
do you think the car was going when it bumped into the other car?”
and the one to the group that estimated 40–50 mph was “How fast do
you think the car was going when it smashed into the other car?”
By the way, the actual speed was 20 mph.
Isn’t it awesome (and kind of scary)that our answers depend so much

on the subtle differences in the tone of the question, rather than the
actual facts? This isn’t a new finding. You have probably heard that
your child is more likely to eat veggies if you ask him/her “Do you want
two spoons of veggies or three?” rather than “Do you want some
veggies?”, and you have probably experienced in brainstorming
sessions and meetings that same questions, framed in different ways,
elicit different responses.
As Dale Carnegie said, “We are creatures of emotion, not logic.”
But what about the creature of logic that answers over 3.5 billion
questions a day a.k.a Google? Do you think it is immune to the tone of
the question?
https://towardsdatascience.com/are-you-asking-the-right-questions-599b85f9703c 1/8
If you are reading this, you probably know that asking Google “Why
chicken is bad for you?” and “Why chicken is good for you?” is going to
give you two sets of very different results that will have little overlap, at
least on the first page of results. These are questions asking for a
specific facet of the subject. You ask what is bad, and Google tries to
answer exactly that. If you ask the same questions to experts of
different camps, say Dr. Peter Attia, and Dr. Michael Greger, they are
still bound to answer the specific facet that was asked for, irrespective
of whether they think if it’s good or bad on the whole.
What about the questions that have different tones, but stem from the
same underlying question? For example, the questions, “Is chicken
healthy?”, and “Is chicken unhealthy?” stem from the same
underlying fact that you are not sure if it’s healthy or not. You are
asking if it is healthy or unhealthy on the whole.
Given the keyword difference, it is a no brainer that the search results

won’t be identical. But to what extent do they differ? We are about to
find out.
TL; DR: We will Google the above-mentioned questions, use Beautiful

Soup to scrape the content of the top results that most of us are likely to
click, build a summarizer using python NLTK to create a four-line
summary of each of these top links, and look at what we end up with.
Long version: I used SEO quake to export the URLs in search results of
each question to a CSV file and then read them into a pandas
DataFrame. Google search results vary based on your location, past
search history, etc. So, you will most likely get different results.
1 #importing relevant libraries

2 import os
3 import pandas as pd
4 import glob
5 import bs4 as bs
6 from urllib.request import Request, urlopen
7 import re
8 path=r'C:\Users\thiag\NLP'
9 files = [os.path.basename(x) for x in glob.glob(path +
10 #creating list of file names of all exported csv files
11 list_of_df=[]
12 for file in files:
13 #looping through csv files to create indiviual
14 #dataframes for each list of search results
15 df=pd.read_csv(file,header=None,names=[file])
16 df=pd.DataFrame(df[file].unique(),columns=[file])
So, we now have a DataFrame with the google query as column headers
and corresponding top URLs as column values.
Scraping: I looped through these URLs, used Beautiful Soup to parse

HTML and extract all contents of each of these websites that were
enclosed in paragraph tags <p>, and then used a bunch of regular
expressions to weed out unwanted stuff. I made this scraper
generalized enough to be used for all websites to extract just good
enough content to feed into our summarizer. That way we can scrape
all websites using one for loop.
Extracting every bit of the exact content would require us to dig into
class ids of each of these HTML pages. For our purposes, this isn’t
required, and our generalized scraper is sufficient to do a good job.
1 #Creating an empty dataframe to store the summaries

2 ind=Cleaned_up_df.index
3 cols=Cleaned_up_df.columns
4 Summary_df_4=pd.DataFrame(index=ind, columns=cols)
5 for col in cols:
6 for i in list(ind):#looping thru each cell of the
7 req = Request(Cleaned_up_df[col][i], headers={
8 article=urlopen(req).read()
9 parsed_article=bs.BeautifulSoup(article,'lxml'
10 paragraphs=parsed_article.find_all('p')#pullin
11 article_text=''
12 for p in paragraphs:
13 article_text = article_text + ". " +p.text
Summarizer:
1. Tokenize the text of each website into a list of individual sentences

using sentence tokenizer(nltk.sent_tokenize), and into a list of
individual words using a word tokenizer (nltk.word_tokenize).
2. Calculate the number of occurrences of each word in the given

text, ignoring the stopwords using a simple for and if loop counter.
3. Calculate the frequency of each word (number of occurrences of a

word divided by the number of occurrences of the most frequent
word).
4. Calculate the score of each sentence by adding the frequencies of

the words in the sentence, store them in a dictionary.
5. Using heap, pick out 4 sentences that have the largest scores. This
is our four line summary. Store these summaries in a DataFrame,
export them as CSV.
1 import nltk
2 sentence_list=nltk.sent_tokenize(article_text)
3 stopwords=nltk.corpus.stopwords.words('english
4 word_frequencies={}
5 for word in nltk.word_tokenize(formatted_artic
6 if word not in stopwords:
7 if word not in word_frequencies.keys()
8 #calculating word frequencies of w
9 word_frequencies[word]=1
10 else:
11 word_frequencies[word] +=1
12 max_freq=max(word_frequencies.values())
13 for word in word_frequencies.keys():
14 word_frequencies[word]=word_frequencies[wo
15 sentence_scores={}
16 for sent in sentence_list:
17 for word in nltk.word_tokenize(sent.lower(
18 if word in word_frequencies.keys():
19 if len(sent.split(' '))<30:
20 if sent not in sentence_scores
2 [ ] d
So what do we have?
When you ask “is chicken healthy?” 4 out of the top 5 links tell you it is
healthy, whereas if you had asked: “is chicken unhealthy?” all 5 top
links tell you it is unhealthy.
So be a little more mindful of your questions, even to Google. Instead of

expecting Google to give you the full picture, ask specific questions and
build the full picture yourself, or search specific websites that, you
trust, will give you good answers. Thanks for reading!
Reference:
[1] Usman Malik, Text Summarization with NLTK in Python

Are You Asking The Right Questions - Towards Data Science

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Are You Asking The Right Questions - Towards Data Science

Uploaded by

Copyright:

Available Formats

4/19/2019 Are you asking the right questions – Towards Data Science

Are you asking the right questions?

In one of the episodes of “Brain Games with Jason Silva”, witnesses of a

By the way, the actual speed was 20 mph.

Isn’t it awesome (and kind of scary)that our answers depend so much

As Dale Carnegie said, “We are creatures of emotion, not logic.”

Given the keyword difference, it is a no brainer that the search results

TL; DR: We will Google the above-mentioned questions, use Beautiful

1 #importing relevant libraries

Scraping: I looped through these URLs, used Beautiful Soup to parse

1 #Creating an empty dataframe to store the summaries

1. Tokenize the text of each website into a list of individual sentences

2. Calculate the number of occurrences of each word in the given

3. Calculate the frequency of each word (number of occurrences of a

4. Calculate the score of each sentence by adding the frequencies of

So be a little more mindful of your questions, even to Google. Instead of

[1] Usman Malik, Text Summarization with NLTK in Python

You might also like