You are on page 1of 8

FEATURE: MOBILE APPS

What Do from the developers perspective.2


However, one of the fi rst steps in
understanding the issues affecting

Mobile App
app quality is to determine the chal-
lenges users face.
The App Store lets users review

Users Complain
their downloaded apps. Besides as-
signing star ratings (all of which
are aggregated and displayed on a
version-level and an app-level basis),

About? users can add comments explaining


their ratings. This data source cap-
tures a unique perspective about us-
ers perception of the apps. Such re-
views, like product reviews in Web
Hammad Khalid, Shopify
stores, correlate highly with down-
load counts (purchases) and are a
Emad Shihab, Concordia University
key measure of an apps success. (For
more on previous research on this
Meiyappan Nagappan, Rochester Institute of Technology topic and other topics related to this
article, see the sidebar.) A good un-
Ahmed E. Hassan, Queens University derstanding of these issues will help
developers deal with users concerns
and avoid low ratings. Furthermore,
// A study of user reviews from 20 iOS apps such an understanding is crucial in
uncovered 12 types of user complaints. The guiding the software engineering
community in tackling high-impact
most frequent complaints were functional research problems in software devel-
errors, feature requests, and app crashes. opments fastest-growing area.
To better understand iOS users
However, complaints about privacy and
complaints, we examined user re-
ethical issues and hidden app costs views for a selection of iOS apps.
most negatively affected ratings. // We aimed to determine the types of
complaints and how they affected
the ratings. Our fi ndings can enable
developers to better anticipate pos-
sible complaints and prioritize their
limited quality assurance (QA) re-
sources on the complaints most im-
portant to them.

The Studys Design


Users tend to write reviews when
MOBILE APPS POPULARITY con- growth of apps in critical domains theyre extremely satisfied or ex-
tinues to grow rapidly. The Apple iOS such as e-commerce, government, tremely dissatisfied with a prod-
App Store alone contained 1.2 mil- and healthcare, has made app quality uct. 3 Poor reviews affect sales more
lion apps as of June 2014 and remains an increasingly important concern. than good reviews because buyers
one of the most competitive app Most recent research on app are more likely to react to low rat-
markets.1 This competition, and the quality has focused on problems ings and complaints. 4 So, to under-

70 I E E E S O F T WA R E | PUBLI S HED BY THE IEEE COMPUTER SO CIE T Y 0 74 0 -74 5 9 / 15 / $ 3 1. 0 0 2 0 15 I E E E


RELATED WORK IN CUSTOMER
PERCEPTION AND MANUAL DATA ANALYSIS
User reviews have become an important source of informa- Our research (see the main article) complements this previ-
tion about customers perceptions. Susan Mudambi and Da- ous research. We manually analyzed user reviews to identify the
vid Schuff showed that user ratings and reviews played a key most frequent and impactful complaints that led to low ratings.
role in the purchasing decisions of products from Amazon.
com.1 Similarly, Hee-Woong Kim and his colleagues inter- References
viewed 30 users who bought apps and found that ratings 1. S.M. Mudambi and D. Schuff, What Makes a Helpful Online Re-
were a key determinant in the purchase.2 Mark Harman and view? A Study of Customer Reviews on Amazon.com, MIS Q., vol.
34, no. 1, 2010, pp. 185200.
his colleagues mined information from 30,000 BlackBerry
2. H.-W. Kim, H.L. Lee, and J.E. Son, An Exploratory Study on the
apps and found a strong correlation between an apps star Determinants of Smartphone App Purchase, Proc. 11th Intl DSI and
rating and its number of downloads.3 the 16th APDSI Joint Meeting, 2011.
3. M. Harman, Y. Jia, and Y. Zhang, App Store Mining and Analysis:
Rajesh Vasa and his colleagues4 and Leonard Hoon and his MSR for App Stores, Proc. 9th IEEE Working Conf. Mining Software
colleagues5 analyzed user reviews of mobile apps. They found Repositories (MSR 12), 2012, pp. 108111.
that the depth of feedback and the range of words were higher 4. R. Vasa et al., A Preliminary Analysis of Mobile App User Reviews,
Proc. 24th Australian Computer-Human Interaction Conf., 2012, pp.
when the users gave a low ratinghighlighting such reviews 241244.
usefulness. 5. L. Hoon et al., A Preliminary Analysis of Vocabulary in Mobile App
Other researchers have performed manual analysis to User Reviews, Proc. 24th Australian Computer-Human Interaction
Conf., 2012, pp. 245248.
highlight critical information for developers. Ferdian Thung and 6. F. Thung et al., An Empirical Study of Bugs in Machine Learning
his colleagues manually categorized bugs in machine-learning Systems, Proc. 23rd Intl Symp. Software Reliability Eng. (ISSRE
systems.6 Similarly, Yuan Tian and her colleagues manually 12), 2012, pp. 271280.
7. Y. Tian et al., What Does Software Engineering Community
analyzed the content of software engineering microblogs to Microblog About? Proc. 9th IEEE Working Conf. Mining Software
better understand what developers microblogged about.7 Repositories (MSR 12), 2012, pp. 247250.

stand why users give low ratings to


iOS apps, we focused on reviews
High-
with one- and two-star ratings. We rated
explored two sources of informa- app
list All
tion: the star ratings and their as- App Select Collect the extracted
sociated comments. We manually market the apps reviews reviews
Low-
tagged the comments to uncover rated
common complaints. Such tagging app
is time-consuming, so we focused list
on a subset of apps and tagged a
statistically representative sample Tag New
of their reviews. Figure 1 diagrams Select Statistical complaint type
the Tagged
the reviews sample identified No
our studys design. reviews reviews
with 95%
confidence Yes
Selecting Apps level
We picked the 20 most popular iOS
apps, as defi ned by the iOS App
Store during June 2012 (see Table 1) FIGURE 1. Overview of our study process. Tagging is time-consuming, so we focused
that were free to download. (How- on a subset of iOS apps (20) and tagged a statistically representative sample of their
ever, some of these apps required a reviews (6,390 reviews for the 20 apps).

M AY/J U N E 2 0 1 5 | I E E E S O F T WA R E 71
FEATURE: MOBILE APPS

Statistics of the studied iOS apps.


TABLE 1

Rating No. of No. of poor reviews No. of sampled


quality App Category stars (1 and 2 stars) reviews

High Adobe Photoshop Express Photo & Video 3.5 1,030 280
(3.5 stars)
CNN News 3.5 1,748 315

ESPN Score Center Sports 3.5 2,630 335

EverNote Productivity 3.5 1,760 315

Facebook Social Networking 4.0 171,618 383

Foursquare Social Networking 4.0 1,990 322

MetalStorm: Wingman Games 4.5 1,666 312

Mint Personal Finance Finance 4.0 1,975 322

Netflix Entertainment 3.5 13,403 373

Yelp Travel 3.5 2,239 328

Low Epicurious Recipes & Shopping List Lifestyle 3.0 940 273
(< 3.5 stars)
FarmVille Games 3.0 10,576 371

Find My iPhone Utilities 3.0 846 264

Gmail Productivity 3.0 1,650 312

Hulu Plus Entertainment 2.0 4,122 351

Kindle Books 3.0 3,188 343

Last.fm Music 3.0 1,418 302

Weight Watchers Mobile Health & Fitness 3.0 1,437 303

Wikipedia Mobile Reference 3.0 1,538 308

Word Lens Travel 2.5 1,009 278

fee for premium features.) We en- Collecting Reviews for each of the apps during the first
sured that the apps had at least 750 The iOS App Store doesnt provide a week of June 2012.
reviews so that a few users didnt public API for automatically retriev-
skew the tagged reviews we ana- ing reviews. So, we obtained the re- Selecting Reviews
lyzed. We also ensured that half of views from AppComments (http:// The 20 apps had more than 250,000
the apps had an overall high rating appcomments.com), a Web service one- and two-star reviews. As we
(3.5 or more stars) and that the other that collects reviews of all iOS apps. mentioned before, we studied a sta-
half had an overall low rating (less We built a Web crawler that visited tistically representative sample of the
than 3.5 stars) because we wanted to each unique page with a specific iOS reviews. To determine the sample
identify the complaints in both the review. We parsed the reviews to ex- size, we used Creative Research Sys-
high- and low-rated apps. The apps tract data such as the app name, the tems Sample Size Calculator (www
covered 15 of the 23 categories in review title, the rating, and the com- . su r veysystem.com /ss c a lc.ht m).
the iOS App Store. ments. We collected all the reviews We randomly chose the sample to

72 I E E E S O F T WA R E | W W W. C O M P U T E R . O R G / S O F T W A R E | @ I E E E S O F T WA R E
Inputs = All reviews (each with a review title and comment) and a list of complaint types (which is initially empty)

For each review:


Manually examine all the text in the review.

If the review matches an existing complaint type:


Tag the review with that type.

Else:
Add a new complaint type to the list of complaint types.
Restart tagging with the new list of complaint types.

Outputs = All reviews (tagged with the appropriate complaint types) and a list of the complaint types

FIGURE 2. The review-tagging procedure. This iterative process helped minimize the threat of human error during tagging.

achieve a 95 percent confidence level Figure 2 shows how we tagged ferent apps with a varying number
and a 5 percent confidence interval. the reviews. Each time we identi- of reviews. Owing to the high devi-
This means were 95 percent confi- fied a new complaint type, we went ance of each complaint type between
dent that each result is within a 5 through all the previously tagged different apps, we used the median
percent margin of error. reviews to see whether to tag them to summarize the frequency of each
For example, Adobe Photoshop with the new type. We had to restart complaint type across the apps.
Express had 1,030 one- and two-star tagging three times after discovering The first three columns of Table 3
reviews. The statistically representa- new types. Sometimes, a reviewer show the complaint type, its rank, and
tive sample for 1,030 reviews, with provided no meaningful comments its median percentage. Three types
a 95 percent confidence level and a (for example, simply saying the app accounted for more than 50 percent
5 percent confidence interval, is 280 was bad). We tagged such reviews as of all complaints: Functional Error,
reviews. So, we randomly selected Not Specific. For reviews contain- Feature Request, and App Crashing.
280 of those reviews for manual ing multiple complaints, we tagged To better understand Functional
examination. them with multiple complaint types. Error, the most frequent type, we
In total, we manually examined For example, if a review mentioned examined the most frequently used
6,390 reviews. We performed our a network problem and complained terms in the related reviews. Then,
sampling on a per-app basis because about the app crashing, we tagged we read through the review com-
different apps have varying numbers the review with Network Problem ments that used these terms. We
of reviews and we wanted to cap- and App Crashing. found that 4.5 percent of the func-
ture the complaints across the differ- tional errors were about location is-
ent apps. The number of randomly Results sues and that 7.3 percent were about
sampled reviews for each app ranged We ended up with 12 complaint authentication problems. Heres an
from 264 to 383 (see the last column types (see Table 2). example of a functional-error review
of Table 1). in which a user reported an authenti-
The Frequency of Each Complaint Type cation problem:
Tagging Reviews We calculated the frequency of the
To identify the complaint types, we complaint types for each app. Then, Dont do the update! When I try to
performed coding, which turns qual- we normalized the frequency (the log in, it just keeps refreshing the
itative information into quantitative number of complaints of a specific screen.
data.5,6 One of us read each review type divided by the total number of
to determine the type of complaint it sampled reviews for an app) so that Examining Feature Request, we
mentioned. we could compare results across dif- found that most requests were app

M AY/J U N E 2 0 1 5 | I E E E S O F T WA R E 73
FEATURE: MOBILE APPS

Identified complaint types.


TABLE 2

Complaint type Description Example review

App Crashing The app often crashed. Crashes immediately after starting.

Compatibility The app had problems on a specific device or an OS I cant even see half of the app on my iPod Touch.
version.

Feature Removal A disliked feature degraded the user experience. This app would be great, but get rid of the ads!

Feature Request The app needed additional features. No way to customize alerts.

Functional Error The problem was app specific. Not getting notifications unless you actually open
the app.

Hidden Cost The full user experience entailed hidden costs. Great if you werent forced to buy coins for REAL
money.

Interface Design The user complained about the design, controls, or The design isnt sleek and isnt very intuitive.
visuals.

Network Problem The app had trouble with the network or responded New version can never connect to server!
slowly.

Privacy and Ethics The app invaded privacy or was unethical. Yet another app that thinks your contacts are fair
game.

Resource Heavy The app consumed too much energy or memory. Makes GPS stay on all the time. Kills my battery.

Uninteresting The specific content was unappealing. It looks great, but the actual gameplay is boring and
Content weak.

Unresponsive App The app responded slowly to input or was laggy overall. Bring back the old version. Scrolling lags.

Not Specific The users comment wasnt useful or didnt point out a Honestly the worst app ever.
problem.

specific. However, 6.12 percent of plaint type among the 10 highest- developers identify features users
the requests were for better notifica- rated and 10 lowest-rated apps. To want or really hate.
tion support. do this, we used a two-tailed Mann-
Network Problem, Interface De- Whitney U test with < 0.05. We The Impact of Each Complaint Type
sign, and Feature Removal com- didnt find any statistically signifi- We determined which of the most
plaints were also frequent. Another cant difference between the highest- common complaints were the most
complaint was Compatibility, which rated and lowest-rated apps. negatively perceived by users. We
is an important issue for iOS devices. Our findings highlight the im- looked at the ratio of one- to two-
This refers to the app not working portance of software maintenance star ratings for each complaint type
correctly on a specific device or OS for iOS apps because many of the (across all apps). For example, a ra-
version. Surprisingly, complaints frequent complaints were related di- tio of 5 indicated that a type had five
about compatibility, resources, and rectly to developmental issues (for times as many one-star ratings as
app responsiveness werent as fre- example, Functional Error, App two-star ratings.
quentwe expected more of them. Crashing, and Network Problem). The last two columns of Table 3
We also examined whether the We believe developers can avoid such show the rank and ratio for each com-
complaint types varied between the low ratings by an increased focus plaint type. The most negatively per-
highest- and lowest-rated apps. We on QA. Also, low ratings frequently ceived complaints differed from the
compared the frequency of each com- contain information that can help most frequent complaints. Privacy and

74 I E E E S O F T WA R E | W W W. C O M P U T E R . O R G / S O F T W A R E | @ I E E E S O F T WA R E
The most frequent and impactful complaint types.*
TABLE 3

Most frequent Most impactful

Complaint type Rank Median (%) Rank 1:2 star ratio

Functional Error 1 26.68 7 2.10

Feature Request 2 15.13 12 1.28

App Crashing 3 10.51 4 2.85

Network Problem 4 7.39 6 2.25

Interface Design 5 3.44 10 1.50

Feature Removal 6 2.73 3 4.23

Hidden Cost 7 1.54 2 5.63

Compatibility 8 1.39 5 2.44

Privacy and Ethics 9 1.19 1 8.56

Unresponsive App 10 0.73 11 1.40

Uninteresting Content 11 0.29 9 1.50

Resource Heavy 12 0.28 8 2.00

Not Specific 13.28 3.80


* All results are at the 95 percent confidence level.

This column indicates the ratio of one- to two-star ratings across all apps.

Ethics, Hidden Cost, and Feature Re- For example, Hulu Plus is free to Discussion
moval were the three most negatively download but has a monthly sub- For many of the complaints, users
perceived complaints (and were mostly scription cost and ads in streaming reported they had recently updated
in one-star reviews). This means that videos. Because of the monthly sub- their app. So, we wanted to study
users were bothered most by issues re- scription requirement, more than the apparent relationship between
lated to privacy invasion and the app 55 percent of the low ratings for updates and complaints. We also
developers unethical actions (for ex- Hulu Plus were about the hidden examined the relevance of different
ample, unethical business practices costs. On closer examination, we types of complaints to software proj-
or selling the users personal data). found that the low ratings were due ect stakeholders (for example, devel-
To avoid such complaints, developers to the developers poor description opers versus project managers).
should access only the data (for ex- of the app or a misunderstanding
ample, the users contacts or location) by the user. Update-Related Complaints
specified in the apps description. Developers should devote extra We could know whether a complaint
Hidden Cost indicated users dis- attention to App Crashing, Hidden was update related only if the user
satisfaction with the hidden costs Cost, and Feature Removal com- mentioned it in the review; however,
needed for the full experience of an plaints because theyre frequent and other complaints could also have
app. This complaint showed up in 15 users perceive them negatively (see been update related.
of the apps. When an app was free to Table 3). Also, our study results Approximately 11 percent of the
download but not free to use, the us- stress the importance of developers sampled reviews mentioned that
ers were disappointed and often gave establishing trust and expectations a recent update impaired existing
low ratings. with app users. functionality. In 22 percent of these

M AY/J U N E 2 0 1 5 | I E E E S O F T WA R E 75
FEATURE: MOBILE APPS

reviews, the users mentioned Func-


ABOUT THE AUTHORS

tional Error complaints after updat-


HAMMAD KHALID is a software engineer at Spotify. His re- ing. Most of these complaints were
search examines the link between user feedback and software app specific. For example, one user
quality. Khalid received a masters in computer science from
reported a bug affecting Adobe Pho-
Queens University. Contact him at hammad@cs.queensu.ca;
http://hammad.ca. toshop Expresss function keys:

Useless now: Was very useful till


last update. ... Function keys no
longer appear during editing.

EMAD SHIHAB is an assistant professor in Concordia Univer- Also, 18.8 percent of post-update
sitys Department of Computer Science and Software Engineer- reviews included requests for a new
ing. Hes particularly interested in mining software repositories,
or previously removed feature. In
software quality assurance, software maintenance, empirical
software engineering, and software architecture. He received a addition, 18.2 percent of the post-
Natural Sciences and Engineering Research Council of Canada update reviews complained about
Alexander Graham Bell Canada Graduate Scholarship. He has frequent crashing.
served on the program committees of the International Confer-
Developers often release free apps
ence on Software Maintenance, the Working Conference on
Mining Software Repositories (MSR), the International Confer- in hopes of eventually monetizing
ence on Program Comprehension, and the Working Conference them by transforming free content or
on Reverse Engineering. He has also been an organizer of the features to paid ones. We found that
MSR 2012 challenge and MSR 2013 data showcase and a
6.8 percent of the post-update re-
program chair for the 2013 International Workshop on Empirical
Software Engineering in Practice. Contact him at emad.shihab@ views complained about this hidden
concordia.ca. cost. Another important post-update
complaint dealt with changes in the
interface design; 6.2 percent of the
post-update reviews contained such
MEIYAPPAN NAGAPPAN is an assistant professor in the complaints.
Rochester Institute of Technologys Department of Software On the basis of these fi ndings, we
Engineering. He previously was a postdoctoral fellow in the
recommend that developers pay spe-
Software Analysis and Intelligence Lab at Queens University.
His research centers on using large-scale software engineering cial attention (for example, through
data to address stakeholders concerns. Nagappan received a regression testing and user focus
PhD in computer science from North Carolina State University. groups) to features they might con-
He received a best-paper award at the 2012 International Work-
sider removing, to adding fees, and
ing Conference on Mining Software Repositories. Contact him at
mei@se.rit.edu; mei-nagappan.com. to user interface changes they might
introduce. Even if users have previ-
ously liked an app, a bad update
could be irritating enough to make
AHMED E. HASSAN is the Natural Sciences and Engineering them give the app a low rating.
Research Council of Canada / BlackBerry Software Engineer-
ing Chair at the School of Computing at Queens University.
His research interests include mining software repositories,
Identifying Stakeholders
empirical software engineering, load testing, and log mining. Because users review apps as a
Hassan received a PhD in computer science from the University whole, they often raise issues that
of Waterloo. He spearheaded the creation of the International arent directly the developers re-
Working Conference on Mining Software Repositories and
sponsibility; some complaints are
its research community. Hassan also serves on the editorial
boards of IEEE Transactions on Software Engineering, Empirical directed toward product managers
Software Engineering, and Computing. Contact him at ahmed@ or other team members. To identify
cs.queensu.ca. these stakeholders, we divided the
complaints into three categories.

76 I E E E S O F T WA R E | W W W. C O M P U T E R . O R G / S O F T W A R E | @ I E E E S O F T WA R E
Development-related complaints Potential Threats References
were related directly to developers. to Validity 1. S. Perez, iTunes App Store Now Has 1.2
Million Apps, Has Seen 75 Billion Down-
They included App Crashing, Func- Because we performed our study on loads to Date, Techcrunch, 2 June 2014;
tional Error, Network Problem, Re- only a sample of 20 iOS apps, our re- http://techcrunch.com/2014/06/02/itunes
-app-store-now-has-1-2-million-apps-has
source Heavy, and Unresponsive sults might not generalize to all iOS -seen-75-billion-downloads-to-date.
App and constituted 45.6 percent of apps. To mitigate this threat, we max- 2. S. Agarwal et al., Diagnosing Mobile
all the complaints. So, many of the imized the coverage of complaints by Applications in the Wild, Proc. 9th ACM
SIGCOMM Workshop Hot Topics in
complaints were directly related to studying apps that covered most of Networks, 2010, p. 22.
problems developers could address. the categories in the App Store. 3. N. Hu, P.A. Pavlou, and J. Zhang, Can
Strategic complaints primar- Also, as we mentioned before, Online Reviews Reveal a Products True
Quality? Empirical Findings and Analyti-
ily concerned project managers but one of us manually tagged the re- cal Modeling of Online Word-of-Mouth
could also partially target devel- views. During this process, human Communication, Proc. 7th ACM Conf.
Electronic Commerce (EC 06), 2006, pp.
opers. These complaints included error or subjectivity could have led
324330.
Feature Removal, Feature Request, to incorrect tagging. To address this 4. J.A. Chevalier and D. Mayzlin, The
Interface Design, and Compatibil- threat, the other authors randomly Effect of Word of Mouth on Sales: Online
Book Reviews, J. Marketing Research,
ity and constituted 22.7 percent of inspected the reviews and corre- vol. 43, no. 3, 2006, pp. 345354.
all complaints. The issues related to sponding tags. 5. C.B. Seaman et al., Defect Categoriza-
these complaints required greater tion: Making Use of a Decade of Widely

U
Varying Historical Data, Proc. 2nd
knowledge of the project and pri- ACM-IEEE Intl Symp. Empirical Soft-
orities and usually didnt have a ser reviews strongly af- ware Eng. and Measurement, 2008, pp.
149157.
straightforward solution. fect developers and orga-
6. C.B. Seaman, Qualitative Methods in
Content complaints concerned nizations that develop iOS Empirical Studies of Software Engineer-
the content or value of the app it- apps. Low ratings negatively reflect ing, IEEE Trans. Software Eng., vol. 25,
no. 4, 1999, pp. 557572.
selfdevelopers had little or no on their apps quality, thus affect-
control over the issues related to ing the apps popularity and even-
these complaints. These complaints tually their revenues. To compete in
included Privacy and Ethics, Hid- an increasingly competitive market,
den Cost, and Uninteresting Con- developers must understand and ad-
tent. Addressing these complaints dress their users concerns.
would require rethinking the apps Our fi ndings point to new soft-
core strategy (the business model or ware engineering research avenues,
the content offered). Although these such as how ethics, privacy, and
complaints accounted for only 3.02 user-perceived quality affect mo-
percent of all complaints, Privacy bile apps. We plan to expand on this
and Ethics and Hidden Cost had the study by considering more apps and Selected CS articles and columns
most negative impact, as we men- comparing our fi ndings across other are also available for free at
http://ComputingNow.computer.org.
tioned before. mobile platforms.

Subscribe today for the latest in computational science and engineering research, news and analysis,
CSE in education, and emerging technologies in the hard sciences.
www.computer.org/cise

M AY/J U N E 2 0 1 5 | I E E E S O F T WA R E 77

You might also like