You are on page 1of 1

cloudera contributes 30-40% of hadoop development

solr for indexed data liek amazon shopping


usecases of spark streaming in telecom industry
drools - rule based engine
.
biggest graphist is facebook - social networking data is in form of graph.
collaborative filtering - Recommendation by ecommerce
mahout is a project which takes all ML algo and converts to map reduce.

file formats
sequence - key value pair
avro- schema based used in flume best for RPC calls

hdfs dfs -cat /loudacre/kb/KBDOC-00289.html | head \


-n 20
refer http://free.primarypad.com/p/devsh_Jan10_bt_ss for extra blogs and notes -
accessible in wifi

IMP -Partitioning in Spark Jobs


The number of partitions depends on 1) size of input data and 2) explict partion
ing number .
Task stages remains same , number of tasks may differ based on any language.
python lib for serialization - pickle
kryo - java + scala

flume - serializers
kafka channel is highly reliable - 0 loss of events

Caused by: py4j.Py4JException: Cannot obtain a new communication channel


Please
You
Class
need
ID:
Key:
goyour
toxUh5pHew
6297 training.cloudera.com
username and password and probably
complete the following
survey info.

You might also like