You are on page 1of 14

Investigation of approaches for sending statistical questionnaires by SMS

Issoufou Seidou Sanda

I. Motivation
Mobile technologies represent an opportunity for improving the effectiveness and efficiency of
generating economic value added in many areas. One of these area is statistical data collection,
which involves costly operations such are censuses and surveys. There were many experiences of
using mobile technologies in statistical operations that have had more or less success (Brown,
Vannable and Eriksen, 2008; Couper, 2005; Reades, Calabrese, Sevtsuk, and Rait, 2007; Stopher,
2009; Vijayaraj and DineshKumar, 2010). One of these method of using new technologies in
statistical operations is the Computer Assisted Personal Interview (CAPI) which has been used for
the last two decades and for which various levels of success were reported (Gravlee , 2002; Baker,
Reginald, Bradburn, and Johnson, 2005). One of the main weaknesses of all the methods that have
ben proposed so far is the cost of the device that should be given to the enumerator. Once the cost
of one device is multiplied by the number of enumerators, the total cost of the survey is still high
and reduce the comparative advantage with traditional survey methods. However, another way of
using mobile technologies for statistical operations has been suggested by ECA management that
would solve that problem of the cost of the mobile devices: the idea is to design approaches that use
the mobile devices already owned by the enumerators in order to avoid the cost of distributing
mobile devices to each enumerator. It is in line with this idea that this paper is investigating the
possibility of sending statistical questionnaires using regular SMS. SMS technology has been
chosen because it is one technology that is available on every mobile phone and the cost of sending
an SMS message is relatively low.

II. Shannon information theory and application to sending statistical questionnaire by SMS
According to Shannon information theory (Shannon, 2001), the information capacity of a
transmission channel can be quantified using the following formula:

H = log S n = n log S
where S is the number of possible symbols that can be used for the communication, and n is the
number of symbols in the transmission. In the case of a SMS message, lets call
number of characters we can use in the SMS message and

nb _ char

the

sms _ length the maximum length

of the SMS message. The capacity of the SMS message is then given by the formula:

sms _ capacity = sms _ length log(nb _ char )


For example, using 7 bits ascii characters, a SMS can send up to 160 characters, out of which we
can use up to 85 to code our questionnaire. The capacity of transmission of the SMS message in this
case is:

sms _ capacity = 140 log(85)


On the other hand, if we consider a statistical questionnaire of q questions, the response for each of
them taking n values, and assuming uniform distribution, we can calculate the maximum
information content of the questionnaire the following way:

For one question with p possible answers, the maximum information content is:

1 log(n) = log(n)

q log(n)

For q questions with n possible answers ecah, the maximum information content is:

For a general questionnaire with p questions indexed with I varying from 1 to q, and

ni

possible

is:

answers

for

the

question

I,

the

maximum

information

content

n
i =1

log(ni ) = log( ni )

For

i =1

identical

questionnaires,

the

maximum

information

content

is:

q i =1 log(ni ) = q log( ni )
n

i =1

Therefore, in direct application of Shannon information theory, the number of statistical


questionnaires we should be able to send with a single SMS is obtained by solving the inequality:
n

q log( ni ) <= sms _ length log(nb _ char )


i =1

Which gives:

q = int(

sms _ length log(nb _ char )


n

log( ni )

) , int being the integer part.

i =1

If we take a questionnaire of an average of 20 questions with an average of 20 options each, and


assuming that we can send up to 160 characters in an SMS and use up to 85 characters, the number

of questionnaires we can send using a single SMS is:

160 log(85)
q = int(
) = 11
20
log(20 )
It means it is possible to send up to 11 questionnaires with one single SMS message using the rigth
compression algorithm. But this is just a theoretical limit and we need to investigate feasible
practical implementations.

III. Proposed practical implementations


III.1) A quasi- optimal approach: reorganizing the vector space
From this point, we consider that the answers to a questionnaire (or the combined Q questionnaires)
are represented by a single point in a vector space of dimension q, q being the number of questions.
The possible values that the answer can take are contained in an hypercube of dimensions

n1 n2 nq , ni being the number of possible answers for question i. We also assume a


maximum legnth of 160 characters for an SMS and 85 possible characters in the SMS. The idea is
to reorganize the supercube of dimensions

n1 n2 nq

into a hypercube of dimensions

85x85x.x85 (m times). In this case the coordinates of the questionnaire (or combined
questionnaires) on each dimension will be represented by a single character and the answer is
converted into a string of m characters that can be sent by SMS. An idea of how this transformation
is done in two dimensions is given by the figure 1 below.
Usding this method we can obtain a reversible function that will reduce the q dimensions of the
answers set into m dimensions with one character coresponding to the coordinate on each of the m
dimensions, wich immeditaly gives the chain of characters to be sent by SMS.
The drawback of this approach is that it requires relatively complex calculations both at the coding
and decoding side and, threfore, in order for the codification to be well done by the emumerator, a
special coding program has to be installed on her/his mobile phone.
We are now going to investigate less optimal approaches that do not require any special equipment
except the mobile phone and a good training of the enumerator.

Figure 1: Converting an hypercube with dimensions greaters than 85 to an hypercube with all dimensions lower than 85
in a reversible manner (2 dimensions case).

III.2) Sub-optimal appraoches that are still acceptable


We are looking here at approaches that do not require any special infrastructure or program on the
enumerators side, the only thing required being for the enumerator to produce the coded string
using the indications on the questionnaire and a little training. In this case, we are really just using
the device that the enumerator has, without even the need to install any special program on it.
A) Coding question by question.
In this case, the options of the questions are enumerated and numbered, then the number of the
option is converted to one or two characters by a mathematical base change. We are using the
following codes (table 2) wich corespond to a modification of the table of 7-bits ascii characters
(given in annex 1).
This table gives directly the way to code into one character any question with less than 85 options.
For questions of more than 185 options, more than one character is needed. The table below shows
how to convert some numbers graeters than 85 using a change of base form base 10 to base 85.
Number
85
86
87
88
89
90
91
161
162
163
164
165
166
167
168
169

Coded
string
10
11
12
13
14
15
16
1/
1:
1;
1<
1=
1>
1?
1@
1{

Number
171
172
173
174
175
176
177
247
248
249
250
251
252
253
254
255

Coded
string
20
21
22
23
24
25
26
2/
2:
2;
2<
2=
2>
2?
2@
2{

Number
1276
1277
1278
1279
1280
1281
1282
1352
1353
1354
1355
1356
1357
1358
1359
1360

Coded
string
f0
f1
f2
f3
f4
f5
f6
f/
f:
f;
f<
f=
f>
f?
f@
f{

Table 1: Coding a few number into two-characters strings


It is not very likely to have questions of more than 85X85 options, which would require three or
more charaters. In this case any questionaire of q questions can be represented by a string of length
lower than 2q.

New
code
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36

Character

a
b
c
d
e
f
g
h
i
j
k
l
m
n
o
p
q
r
s
t
u
v
w
x
y
z
A

Description
Zero
One
Two
Three
Four
Five
Six
Seven
Eight
Nine
Lowercase a
Lowercase b
Lowercase c
Lowercase d
Lowercase e
Lowercase f
Lowercase g
Lowercase h
Lowercase i
Lowercase j
Lowercase k
Lowercase l
Lowercase m
Lowercase n
Lowercase o
Lowercase p
Lowercase q
Lowercase r
Lowercase s
Lowercase t
Lowercase u
Lowercase v
Lowercase w
Lowercase x
Lowercase y
Lowercase z
Uppercase A

37
38
39
40
41
42
43

B
C
D
E
F
G
H

Uppercase B
Uppercase C
Uppercase D
Uppercase E
Uppercase F
Uppercase G
Uppercase H

0
1
2
3
4
5
6
7
8
9

New code
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80

Character
I
J
K
L
M
N
O
P
Q
R
S
T
U
V
W
X
Y
Z
!
"
#
$
%
&
'
(
)
*
+
,
.
/
:
;
<
=

81
82
83
84
85

>
?
@
{
}

Description
Uppercase I
Uppercase J
Uppercase K
Uppercase L
Uppercase M
Uppercase N
Uppercase O
Uppercase P
Uppercase Q
Uppercase R
Uppercase S
Uppercase T
Uppercase U
Uppercase V
Uppercase W
Uppercase X
Uppercase Y
Uppercase Z
Exclamation mark
Double quotes (or speech marks)
Number
Dollar
Procenttecken
Ampersand
Single quote
Open parenthesis (or open bracket)
Close parenthesis (or close bracket)
Asterisk
Plus
Comma
Hyphen
Period, dot or full stop
Slash or divide
Colon
Semicolon
Less than (or open angled bracket)
Equals
Greater than (or close angled
bracket)
Question mark
At symbol
Opening brace
Closing brace

Table 2: Codification of number from 1 to 85 (modified ascii table)

To facilitate the conversion by the enumerator, the one or two characters will be shown in front of
each option for each question on the questionnaire. In order for the enumerator to generate the
chain of characters to send by SMS, all what is needed is to concatenate the characters chains in the
order of the questions. The character chain will be decoded at the reception side, where basic
checks can be run immediately with a SMS sent back to the enumerator if an error is found.
Simple example with a questionnaire of a few questions:
Identification of the household:
Possible answers: 1 to 2000.
Current answer: 211.
Coresponding string: 2E.
Identification of the individual in the household:
Possible answers: 1 to 20.
Current answer: 3.
Coresponding string: 3.
Sex:
Possible answers: 1 or 2 or 3 (for n.a.)
Current answer: 2.
Coresponding string: 2.
Age:
Possible answers: 1 to 120.
Current answer: 93.
Coresponding string: 18.
Education level:
A few possible answers (ISCED) with coresponding strings:
24 Lower secondary education:

241 Insufficient for level completion or partial level completion, without direct access to upper
secondary education (2)

242 Sufficient for partial level completion, without direct access to upper secondary education
7

(2*)

243 Sufficient for level completion, without direct access to upper secondary education (2+)

244 Sufficient for level completion, with direct access to upper secondary education (2,)

25 Lower secondary vocational education:

251 Insufficient for level completion or partial level completion, without direct access to upper
secondary education (2=)

252 Sufficient for partial level completion, without direct access to upper secondary education
(2>)

253 Sufficient for level completion, without direct access to upper secondary education (2?)

254 Sufficient for level completion, with direct access to upper secondary education (2@)

Current answer: 253.


Coresponding string: 2?.
Character string to send based on tehse answers: 2E32182?

B) Combing quiestions with few options


As already indicated, the method in A) is sub-optimal as it does not allow reaching compression
levels close to the one given in the theoretical formula using Shannons theorem. One possible way
of reducing the length of the character string is to combine several questions with few options into
one question with many options. This reduce the space wasted in the information capacity of the
SMS.
C) Using the redundancies in the answers
We have assumed so far uniform distributions, but this is not the case in the real world. Some
answers will always have higher probabilities than others. This can be used as opportunity to reduce
the length of the string to be sent. For example we can use a reference string that indicate the most
likel;y answer and indicate in subsequents messages only those variables that have changed as
compared to the reference. Other methods should be the subject of further research.

III.3) Implementation using an electronic questionnaire


In order to reduce the number of errors during entry, it may be necessary to use an electronic device
that can run consuistency checks during data entry, encode the data and if necessary, encrypt them
to ensure the confidentiality of the data as per the statistical law. One option is to develop a
program to be installed on the mobile phone of the enumertor. Another option is to develop a lowcost electronic device, specifically made for the task. In this case, the second option was tested
using a 3.5$ educational microcontroler (Picaxe 20M2) and an electronic circuit simulator (Proteus
VSM). The circuit, shown in the figures below is fully functional and shows that it is possible to
develop data entry devices for less than 10$ than can be used in censuses and survey instead of
expensive PDAs.
In this specific case, the electronic questionnaire developed was used for entering the questions and
encoding them using a very simple mechanism.

Figure 2: The simulated electronic questionnaire with data entered using the following pattern: #Question
number*Selected option#Question number*Selected option

Figure 3: The simulated electronic questionnaire working on the encoding of the data.

Figure 4: String generated by the encoding process

10

This questionnaire has only the basic encoding functionality, but it can be extended with an SMS
sending module to directly send the coded string without using another mobile device, an
mechanism to directly transfer the coded string to the mobile phone of the enumerator (cable or
bluetooth), additional memory to store the manuals and dictionaries for consultation by the
enumerator, and so on. The possibilities are only limited by the cost of the additional module. The
electronic circuit and the code that drives the microcontroler are available on demand.
IV. Conclusions
Sending structured information by SMS has tremendous applications, not only for surveys but also
for administrative data collection, civil registration (registering births by SMS), etc. It is an
approach that is very cost effective because it does not require any special infrastructure on the side
of the one collecting the data. It allows automatic checking of the answers in a centralized way and
can immeditaly inform the enumerator of errors to correct via SMS. It can be a very efficient way of
leveraging mobile technologies for statistical operations in developping countries. What is needed
is to develop the programs at receiving end that can receive and decode the strings, as well as check
the answers and send notifications to the enumerator in case error are detected.

11

V. Short bibilography
Baker, Reginald P., Norman M. Bradburn, and A. Johnson. "Computer-assisted Personal
Interviewing: An experimental evaluation of data quality and costs." JOURNAL OF OFFICIAL
STATISTICS-STOCKHOLM- 11 (1995): 415-434.
Brown, Jennifer L., Peter A. Vanable, and Michael D. Eriksen. "Computer-assisted self-interviews:
A cost effectiveness analysis." Behavior research methods 40.1 (2008): 1-7.
Couper, Mick P. "Technology trends in survey data collection." Social Science Computer Review
23.4 (2005): 486-501.
Gravlee, Clarence C. "Mobile computer-assisted personal interviewing with handheld computers:
The Entryware System 3.0." Field Methods 14.3 (2002): 322-336.
Reades, J., Calabrese, F., Sevtsuk, A. and Ratti, C. "Cellular census: Explorations in urban data
collection." Pervasive Computing, IEEE 6.3 (2007): 30-38.
Shannon, Claude Elwood. "A mathematical theory of communication." ACM SIGMOBILE Mobile
Computing and Communications Review 5.1 (2001): 3-55.
Stopher, Peter R. "Collecting and processing data from mobile technologies." Bonnel, P.; LeeGosselin, M.; Zmud, J (2009): 361-391.
Vijayaraj, A., and P. DineshKumar. "Design and Implementation of Census Data Collection System
Using PDA." International Journal of Computer Applications 9.9 (2010).

12

Annex 1 : ASCII 7 bits printable characters


No

Symbol
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83

!
"
#
$
%
&
'
(
)
*
+
,
.
/
0
1
2
3
4
5
6
7
8
9
:
;
<
=
>
?
@
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
P
Q
R
S

Description
Space
Exclamation mark
Double quotes (or speech marks)
Number
Dollar
Procenttecken
Ampersand
Single quote
Open parenthesis (or open bracket)
Close parenthesis (or close bracket)
Asterisk
Plus
Comma
Hyphen
Period, dot or full stop
Slash or divide
Zero
One
Two
Three
Four
Five
Six
Seven
Eight
Nine
Colon
Semicolon
Less than (or open angled bracket)
Equals
Greater than (or close angled bracket)
Question mark
At symbol
Uppercase A
Uppercase B
Uppercase C
Uppercase D
Uppercase E
Uppercase F
Uppercase G
Uppercase H
Uppercase I
Uppercase J
Uppercase K
Uppercase L
Uppercase M
Uppercase N
Uppercase O
Uppercase P
Uppercase Q
Uppercase R
Uppercase S

13

84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127

T
U
V
W
X
Y
Z
[
\
]
^
_
`
a
b
c
d
e
f
g
h
i
j
k
l
m
n
o
p
q
r
s
t
u
v
w
x
y
z
{
|
}
~

Uppercase T
Uppercase U
Uppercase V
Uppercase W
Uppercase X
Uppercase Y
Uppercase Z
Opening bracket
Backslash
Closing bracket
Caret - circumflex
Underscore
Grave accent
Lowercase a
Lowercase b
Lowercase c
Lowercase d
Lowercase e
Lowercase f
Lowercase g
Lowercase h
Lowercase i
Lowercase j
Lowercase k
Lowercase l
Lowercase m
Lowercase n
Lowercase o
Lowercase p
Lowercase q
Lowercase r
Lowercase s
Lowercase t
Lowercase u
Lowercase v
Lowercase w
Lowercase x
Lowercase y
Lowercase z
Opening brace
Vertical bar
Closing brace
Equivalency sign - tilde
Delete

Source : http://www.ascii-code.com/

14

You might also like