You are on page 1of 7

See

discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/297206976

Big Data−An Evolving Concern for Forensic


Investigators

Conference Paper · November 2015


DOI: 10.1109/Anti-Cybercrime.2015.7351932

CITATION READS

1 68

2 authors:

Shahzaib Tahir Waseem Iqbal


City, University of London National University of Sciences and Technology
10 PUBLICATIONS 5 CITATIONS 18 PUBLICATIONS 19 CITATIONS

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

MS Thesis View project

Conference Paper View project

All content following this page was uploaded by Shahzaib Tahir on 07 March 2016.

The user has requested enhancement of the downloaded file.


Big Data−An Evolving Concern for Forensic
Investigators

Shahzaib Tahir, Waseem Iqbal


Department of Information Security, College of Signals
National University of Sciences and Technology (NUST)
Islamabad, Pakistan
shahzaib.msis12@students.mcs.edu.pk; waseem.iqbal@mcs.edu.pk

Abstract— Big Data is a term associated with large datasets forensic is the investigation done to hold a person culprit or
that come into existence with the volume, velocity and variety of innocent of doing a cyber crime. Keeping this fact under
data. An ever increasing human dependence on computers and consideration the integrity of the digital data can be termed
automated systems has caused data to increase massively. The highly critical during the investigation.
substantial collection of data is not only helpful for researchers
but equally valuable to investigators who intend to carry out Advancements in the field of digital forensics have
forensic analysis of data associated with the criminal cases. The resulted in the development of methodologies that assist in
conventional methodologies of performing forensic analysis have carrying out forensic activities. The information age has
changed with the emergence of big data because big data forensic resulted in the generation of huge amounts of data that can
requires more sophisticated tools along with the deployment of serve as evidence, clue, fact or an indication of what
efficient frameworks. Up till now several techniques have been happened. Acknowledgement of the importance of data and
devised to help the forensic analysis of small datasets but none of the linkages between data has given rise to the term “Big
the techniques have been studied by coupling them with big data. Data”. Big data has forced the development of techniques and
Hence in this paper different techniques have been studied by tools that are applicable to data sets so large and complex that
closely analyzing their feasibility in the extraction and the conventional data processing techniques cannot be applied to
forensic analysis of evidence from large amounts of data. In this them. The techniques should be user friendly, highly
paper we discuss various sources of data and how techniques interactive and equally aesthetic to facilitate the process of
such as the MapReduce framework and phylogenetic trees can
investigation.
help a forensic investigator to visualize large data sets to conduct
a forensic analysis. Since audio and video are an attractive source According to a recent survey conducted by the American
of forensic data therefore this paper also discusses the latest Institute of CPA “Big Data is listed as the top issue facing
techniques that assist in the extraction of useful sound signals forensic and valuation professionals in next two to five years”
from noise infested audio signals. Similar techniques for forensic [2].Over the past decade extensive research was being done to
analysis of the images have also been presented. Based upon improve the digital forensic techniques in order to make the
interviews conducted with the forensic professionals, the factors task of the forensic investigator easier. Scope of the
affecting big data forensic techniques along with their severity
investigations was limit to a workstation, office or an
have been identified so that a scenario specific approach can also
organization due to which the tools being developed were not
be adopted based upon the available investigative resources.
applicable to large datasets. This era requires digital forensic
Keywords— Big Data; Forensics; Phylogenetic Trees; Digital investigations in corporate sector, multinational companies
Forensics; MapReduce; Hadoop Distributed File System; Blind and large scale data centres. Currently, extensive research is
Source Separation; Image Culling. being done to carryout big data analysis so the relationship
among the data can be uncovered in an effective and efficient
manner. Many researchers have proposed different techniques
I. INTRODUCTION
that have their own advantages and disadvantages. Often focus
As humans become increasingly reliant on computers, has remained on the use of trees that can help in visualizing
incidents involving computing based crime have also risen. large data sets and revealing the relationship among those
Cyber crime is a term that refers to the use of computers to datasets [3][4]. Internet can be termed as a sea of data and this
carry out a crime. Owing to its digital nature cyber crime sea is becoming deeper with the passage of every second.
cannot be investigated using conventional investigative Hence it isn’t wrong to associate the term big data to the data
techniques and requires sophisticated software to conduct the residing over the internet. For the analysis of the data over the
investigation. Digital forensics can be defined in a number of internet or Apache server, Hadoop distributed file system is
ways but a concise definition would explain digital forensics used which employs the MapReduce Framework [5]. In the
as scientific methodologies or steps that are taken towards the world of digital forensic, sometimes the evidence can be
collection, identification, analysis, documentation of the audio. The extraction of desired signals from a large set of
evidence that is derived from a digital source and can be signals or mixture of signals requires a technique termed as
presented in the court when required [1]. Hence digital

978-1-4799-7620-1/15/$31.00 ©2015 IEEE


the Blind Source Separation [6]. All of these techniques are data. This term is also referred to as the preprocessing of the
used in the analysis of data, but none of the techniques have data and can help facilitate the visualization of the data.
been discussed to facilitate the forensic investigator in
performing the big data forensics. Hence in this paper we x Visualization –Visualization facilitates the users by
discuss some existing techniques and some novel techniques. helping them to view and closely analyze the data in an
This paper also focuses on the tools/software that are required aesthetic manner. In some of the cases it also provides a
to carry out the above mentioned investigations. graphical user interface that helps the user to directly interact
with the system.
In this paper we have analyzed different techniques that
assist in the extraction of probable evidence from innocent All the above challenges if dealt with properly and
information. We also propose a novel technique that can be appropriately can help in accurate forensic investigations.
used to visualize large datasets using phylogenetic trees and Identification of data sources is a crucial part of big data and
graphs. Our proposed technique makes it convenient for an forensics. Data that has not been analyzed/ collected can result
investigative analyst to understand and visualize data that is in imprecise investigations and can end in non indictment due
already stored in a database. Our technique can help to to non conclusive evidence. Latest trends in computing
visualize data residing over the internet or on a system. In dictates that an exhaustive investigation be launched because
Section II of the paper we discuss big data in detail and list the of changing forms of media and data sources. For instance
different sources of big data. Section III is based on several social media accounts, network profiles, network connections
subsections and discusses different methods to cater with big and device connection histories are some of the sources that
data forensics in detail. This section discusses MapReduce must be considered for a full data collection activity. There are
Framework, Phylogenetic Trees/Graphs, Blind Source several other sources of big data that can be a source of digital
Separation and Image Culling in detail. The diversity of each evidence. The sources are
framework in terms of applicability is also discussed. Section x Sensor Data
IV identifies the factors based on which the forensic technique
is to be selected. Towards the end a conclusion about the x CCTV Footage
different techniques discussed and proposed in this paper has x Images
been provided by discussing future aspects of the techniques.
x Email
II. BIG DATA AND SOURCES x Social Media
Big data is the collection of datasets so large, enormous
and complex that the data itself cannot be analyzed without x Intrusion Detection Data
using sophisticated software and applications. The term Big x HTML
Data came into existence with the emergence of the “Three
V’s” associated with data [9] i.e. the increased volume of the x Location etc
data, the velocity of emergence of data and the variety of data.
Different researchers have listed several challenges associated III. BIG DATA FORENSIC TECHNIQUES
with the big data that are listed below [3][7][8][9]:
A formal forensic investigation cannot be launched until
x Capture–refers to the collection of the relevant data meaningful and relevant data is not extracted from the entire
from different data sources and taking the evidence into data set [17]. In this section we focus on the different
custody so that the relevant data can be extracted, processed techniques that can facilitate the forensic investigator in
and analyzed. analyzing the big data to find the underlying relationship
among the data. Furthermore these techniques will help the
x Curation–focuses on the acquisition and care of data. investigator in extracting meaningful and purposeful forensic
Curation looks into data collection, documentation, evidence from the large datasets:
research/analysis of the collected data and sharing the data
with the public. Curation also requires maintaining a chain of A. MapReduce Framework
custody form/evidence form to keep track of all the process
and evidence collected. Apache Hadoop is one of the most widely used and most
popular platforms for big data processing. Hadoop is an open
x Storage–Aims towards the analysis of data to uncover source software framework which is governed by and lies
the characteristics/relationship among the data so that the data under the umbrella of Apache Software Foundation (ASF).
can be stored effectively and efficiently. Hadoop allows the analysis of large quantities of
structured/unstructured data effectively and efficiently [10].
x Search –Searching of data is a challenge associated
Hadoop has gained popularity over the past several years
with the retrieval of the desired data from the large datasets.
owing to its useful feature set accompanied with its scalable
This requires proper scripts along with highly professional
nature and no system dependency setup. Hadoop is being used
skills to carry out the search.
for performing textual searches and analysis of large volumes
x Analysis –The reduced datasets need to be closely of data. Though Hadoop deals with the massive amounts of
analyzed to uncover the underlying relationships among the data residing over the internet and mainly in the Apache
Server but unfortunately this technique isn’t widely used by
the forensic investigators to carry out the big data being used in real time environment or for jobs that have a
investigations. The reason lies in the fact that this technique shorter duration. The toolset can only be used for static data
still needs to be explored in association with big data forensic. that isn’t changing at runtime. Perhaps the greatest concern
Hence we aim to shed light on this technique while coupling it while using Hadoop is that it requires proper technical
with big data. developers for coding and debugging.
Hadoop is based on two very important components that is
the HDFS (Hadoop Distributed File System) and MapReduce B. Phylogenetic Trees/graphs
[11]. What makes HDFS different from other DFS is the fact Data visualization refers to the use of tools and techniques
that in HDFS there are thousands of servers connected with that help in the understanding and interpretation of closely
each other. The interconnectivity of servers helps in mitigating related data. Often data is so large that proper visualization is
the risks associated with the failure or error occurrence on any heavily dependent on a data structure that is both
server and as a result provides a system that is highly fault algorithmically efficient and aesthetically pleasing. Tree based
tolerant. HDFS is extremely suitable for applications dealing data structures have been used in the analysis of big data. In
with large datasets as it provides high throughput access to the [4] we have already presented a technique that helps in data
data. HDFS is a technology that provides three utilities i.e. visualization and in this paper we intend to apply it on the
distribution, replication and automatic recovery of the data forensic investigations. Our technique is based on the
[11]. All these utilities are administered by the MapReduce generation of a phylogenetic tree [14]. Decision trees help the
framework. Firstly the data is distributed across a distributed investigator to reach a conclusion based on the defined rule
file system. Each word is separately assigned a key which is base; whereas the phylogenetic tree generated in our scheme is
termed as Map in the framework. The key pairs are added that a circular tree that shows the interdependency of data.
reduce the number of outcomes of the desired data and the Phylogenetic trees which are also termed as graphs are very
resulting dataset is termed as the useful data and in terms of efficient in the depiction of links between data. The only
the framework this step is called as “Reduce”. MapReduce downside of using trees is that they require connection to the
uses distributed databases based filtering and sorting appropriate relational database application based on
algorithms to provide a highly scalable solution to data preprocessing of the big data in order to narrate the close
extraction. Figure 1 shows the MapReduce framework in relationships among the data.
detail. The data is divided and distributed among the data After getting hold of enormous data, an investigator can
structure ranging from D1 to Dn. This is the mapping phase. run automated scripts of SQL to preprocess data and hence
Now the individual datasets are assigned unique keys that are make a relational database. Since large datasets are involved
helpful in reducing the data depending upon the algorithm therefore visualization will assist the investigation in easy
used. Based on these keys the result sets are then obtained. analysis of data. Figure 2 shows the flow of events to generate
a graph/phylogenetic tree that can assist in the visualization of
the big data. The proposed methodology uses “GraphViz”
Java library [13] to generate a DOT file of the extracted data
[16]. Once a DOT file is generated it is parsed so that a graph
can be created.

Fig. 1. MapReduce Framework

The MapReduce framework used by Hadoop makes


forensic analysis of the data very convenient as it helps the
search and collection of desired data from enormous datasets
not only very effective but also very efficient. In most of the
forensic investigations the time is one of the most critical
factors and this factor is kept well under consideration by
using Hadoop. Hadoop requires high performance computing
based on highly sophisticated/ power full super computers or
grid computing facility. Therefore this technique is applicable
to organizations such as the military that have a major target to
achieve as a result of the investigation and have sufficient
computing resources. Hadoop can also be used to analyze log Fig. 2. Flow of events for graph/phylogenetic tree generation
files and web clicks. Though it provides high end solutions it
is not based on databases. Further Hadoop isn’t effective while
For the graph generation the integrity of the DOT file is C. Blind Source Separation (BSS)
very important because the DOT file once parsed forms the Audio forensic is also important in an investigation as
graph. The following data has been taken from a DOT file sometimes an audio file may be recovered from a crime scene.
generated for “Hermann Maurer”. The DOT file mentions the This auditory file may serve the purpose of evidence in the
nodes of the phylogenetic tree, layout, width, height and label. forensic investigation. The audio file can hold background
Based on the mentioned values the tree will be generated sounds that are being generated from different sources and
may be unimportant. This may cause problem for the forensic
investigator, hence the extraction of meaningful audio/sound
from the background sound poses many challenges. To cater
with this problem a technique is used which is called as Blind
Source Separation (BSS). BSS is defined as the partition of a
set of source signals from a set of mixed signals, without the
aid of information (or with very little information) about the
source signals or the mixing process [21]. Up till now BSS
was successfully being used in signal processing and mainly in
the speech recognition systems, telecommunication and
medical signal processing. In this paper we aim to extend the
Though tree visualization looks aesthetically pleasing and utilization of BSS in the digital forensics so that it can help the
facilitates the forensic investigator, it requires highly trained forensic investigator. BSS basically separates a collection of
professionals who can code on sophisticated computer systems signals and identifies them as individual signals. The
that can run the visualization application on large datasets. identification and separation of the sound from background
This technique also requires prior analysis of the data to noise is done based on mathematical underpinning. Apart from
identify appropriate attributes that can help link different digital forensic investigations Blind Source Separation has a
datasets together based on the relationships. Hence this wide range of applications [18]
visualization technique can be deployed in well organized x Audio
laboratory settings. Figure 3 represents the graph generated
from the above mentioned DOT file. This graph demonstrates x Speech
the relationship among the different datasets and shows the
x Music Industry
working sets of an author with other coauthors. This particular
graph shows the total number of publications of an author, x Image
names of coauthors, number of publications of coauthors and
common publications. The node size also changes according x Video
to the data. Hence this helps the investigator in visualizing the x Biomedical
data efficiently and effectively. It can be seen in Figure 3 that
this technique is not only applicable to forensic investigations x Communication signal processing
rather it is equally applicable to any field of life ranging from
x Forensic analysis etc
online datasets to the health care systems. It can help give a
pictorial representation to any dataset available and as a result In general it is observed that a recorded sound comprises
the forensic investigator can easily reach a conclusion based of meaningful sounds and meaningless noises/ interferences.
on the graph. Hence the aim here is to separate meaningful sound from
noise with the help of a filter. Noise has a distortive effect and
therefore masks the underlying signal that is to be analyzed.
Noise on its own adds no information and is of no importance
to the forensic investigator hence it needs to be eliminated.
This filter acts as a data reduction step. Once the data has been
filtered the sound has to be reconstructed based upon different
techniques of interpolation. The success of the filtering i.e. the
data reduction and reconstruction is completely dependent
upon the nature of the noise and the signal that needs to be
separated from the signal. This complete process can be
explained with the help of a model called as the “Blind Source
Separation (BSS)” model. Figure 4 pictorially represents the
BSS Model. Once the signal is reconstructed the signal
becomes clear and easily conveys its purpose and meaning to
the forensic investigator.

Fig. 3. Graph/Phylogenetic Tree Generation


D. Image Culling
Image forensic requires an automated technique used to
separate a suspect image from a dataset of several hundred
images. As big data refers to large amount of datasets being
obtained from different sources, same is the case with images.
Nowadays there are different formats of images such as TIFF,
JPEG, GIF, PNG and Raw image files that are taken into
custody from crime scene. There are images that may serve
the purpose of evidence and may help the investigator to carry
out the investigation. These images can be gathered from large
sources so the extraction presents many challenges to the
forensic investigators. There are several culling techniques
including indexing, de-duplication, file extension
inclusion/exclusion, keyword identification and date range
Fig. 4. Blind Source Separation(BSS) Model
searches [20]. Based on these techniques the resulting dataset
Firstly the audio file/image is acquired from the crime of suspect images can be narrowed down.
scene. Once the audio file is taken into custody, digitization is
The investigation starts with the acquisition of the datasets
done that requires special hardware and software to perform
and the original images are preserved to maintain integrity of
this crucial step of audio analysis. Audio frequency analysis,
the primary dataset. This step also requires a chain of custody
amplitude adjustments and Fourier transforms are some of the
form to be maintained. The next step named as culling means
basic techniques that define the starting point of this operation
extraction of appropriate/meaningful images from the large
[19]. The BSS performs separation of the different sound
datasets. Once the images have been extracted the data is to be
sources. Based upon the features of the sounds i.e. Gaussian
processed so that the tempering can be detected based on
formula the waves are identified. With the help of
rasterization and the required investigations can be performed
classification tools the sounds are reconstructed. Once the
as desired. Once the images have been processed the images
sounds have been reconstructed the resulting sounds are
are reviewed by expert forensic investigators. Figure 6 is a
presented to the forensic investigator so that they can be used
depiction of the image culling lifecycle.
to perform a closer analysis on the extracted meaningful sound
from the sound image. Figure 5 shows the signal processing
module that helps the separation and reconstruction of the
signals.
BSS requires sophisticated hardware and software devices
to distinguish between the signals originating from different
sources along with expert professionals to carry out the entire
Source Separation process. This technique was previously
used mainly by the music industry and healthcare
organizations. Now we intend to introduce this technique for
performing the digital forensic investigations.

Fig. 6. Image Culling Lifecycle

This technique requires hardware that can provide high end


processing speed along with high resolution image viewing
capability. The high end processing will help render the
dynamic scenes with high depth complexity. The rasterization
also requires dedicated resources to convert the vector graphs
into raster images. The image culling is to be performed by
highly professional and expert forensic investigators. Hence
this technique can only be used in an image processing
laboratory having dedicated hardware and software resources
to carry out the investigation.

Fig. 5. Signal Processing Module


IV. FACTORS EFFECTING FORENSIC TECHNIQUES we also discuss the use of image culling as a technique that
In this section we extend our analysis of the techniques separates evidence images from large dataset of images. By
proposed previously for the big data forensic investigation. discussing these techniques we have successfully shed light on
There are different factors that can influence the selection and the evolving challenges that have risen with the big data and
performance of the forensic techniques. These factors should how forensic analysis of big data can be catered for. We have
be analyzed closely before carrying out the investigation. also identified the factors that should be kept under
Table 1 represents factors that can affect the big data forensic consideration during the selection of a forensic technique.
techniques and narrates the severity of these factors on the
respective technique. This table has been populated based on REFERENCES
interviews and surveys with forensic professionals. The
[1] AICP, “The 2014 AICPA Survey on International Trends in Forensic
selection of the forensic technique is based on the available and Valuation Services”, Forensic and valuation services section,
resources (cost, time, human), and the sensitivity of the data to American Institute of CPA, New York, 2014.
be used for investigation. [2] B. Nelson, A. Phillips, C. Steuart, “Guide to Computer Forensics and
Investigation”, 4th Edition, 2010
TABLE I. FACTORS AFFECTING BIG DATA FORENSIC TECHNIQUES [3] M. Afzal, A. Latif, A. Saeed, P. Strumm, S. Aslam, K. Andrews, K.
Tochtermann, H. Maurer, “Discovery and Visualization of Expertise in a
Scientific Community,” Proceedings of the 7th International Conference
on Frontiers of Information Technology, 2009.
[4] S. Tahir, M. T. Afzal, “A Novel Phylogenetic Tree Data Visualization
Application for Researchers”, Science and Information Conference
2014, London, pp. 93-99, 27-29 August, 2014.
[5] D. Borthakur, “The Hadoop Distributed File System: Architecture and
Design”, The Apache Software Foundation, 2007.
[6] D. Schobben, K. Torkkola, P. Smaragdis, “Evaluation of Blind Signal
Separation Methods”, http://citeseerx.ist.psu.edu/, 1999.
[7] P. Breuer, L. Forlna, J. Moulton, “Beyond the hype: Capturing value
from big data and advanced analytics”, Perspectives on retail and
consumer goods, Springer 2013.
[8] R. J. Miller, “Big Data Curation,” DIMACS Big Data Integration
CONCLUSION Workshop, 2013.
[9] D. Sindol, “Big Data Basics – Part 1 – Introduction to Big Data,”
The term “Big Data” is not new, only the tools used in big MSSQLTips, 2013.
data are new and emerging. Big data has posed many [10] K. Shvachko, H. Kuang, S. Radia, R. Chansler, “The Hadoop
challenges for the forensic investigators. In this paper some of Distributed File System,” 978-1-4244-7153-9, IEEE, 2010.
the techniques that were previously being used in a different [11] S. C. Mouliswaran, S. Sathyan, “Study on Replica Management and
High Availability in Hadoop Distributed File System (HDFS),” Journal
context have been analyzed that can facilitate the forensic of Science, Information Technology, Vol. 2, Issue 2, pp. 65-70, 2012.
investigator in performing big data forensic. Big data forensic [12] D. Borthakur“HDFS Architecture Guide,” The Apache Software
analysis is a comprehensive sequence of activities that are Foundation, 2008.
heavily dependent on full identification of data sources. [13] Graphviz - Graph Visualization Software, http://www .graphviz.org.
Modern data resources must include data related to sensors, [14] B. Reddy, “Basics for the Construction of Phylogenetic Trees,” Webmed
network profiles, sound files, image files, intrusion detection Central BIOLOGY 2011, Vol. 2, no. 12, Dec 2011.
related data. For the analysis of big data residing over the [15] I. Tollis, "Graph Drawing and Information Visualization," ACM
Computing Surveys, Vol. 28A, no. 4, December 1996.
internet MapReduce framework is used that can narrow down
[16] Gansner, E. R., Koutsofios E., North S., “Drawing graphs with dot”.
the large datasets. The MapReduce framework is governed by Dots User Manual 2 November 2010.
Hadoop which is a specialized DFS for large datasets. The [17] A. Guarino, “Digital Forensic as a Big Data Challange,” StudioAG,
presence of such specialized tools is needed for the working 2013.
with data sets specific to big data. In this paper we discuss the [18] J. Han, Z. Rafii, B. Pardo, “Audio Source Separation,” Interactive Audio
advantages of data visualization through the use of Lab, http://music.cs.northwestern.edu/.
phylogenetic trees owing to their ability to provide [19] Y. Wang, Z. Zhou, “Source Extraction in Audio Via Background
aesthetically pleasing data visualizations. Learning,” Inverse Problem Image, 2011.
[20] “Big Data Culling,” Cost effective data minimization, Digital Discovery,
http://www.digitaldiscoveryes.com.
Since the sources of data are complex and diverse in nature
[21] R. Acharyya, F. M. Ham,“A New Approach for Blind Separation of
therefore we discuss the effect of noise on audio data. To Convolutive Mixtures”, International Joint Conference on Neural
remove the impact of noise the utilization of Blind Source Networks, IJCNN, Orlando, Fl, 2007.
Separation technique for the auditory forensic analysis has
been discussed. Images suffer from similar problems and
require specialized tools for image processing. In this paper

View publication stats

You might also like