Data Mining
Tools
Hadoop
Hadoop is a framework for Map Reduce. It is an open source top level Apache project. This framework is good for huge size of data, complex data, complex analysis or high velocity of data which need to be processed near real time.
Map Reduce is based on functional programming model. The Hadoop ecosystem consists of
- Core: Hadoop Map Reduce and Hadoop Distributed File System (HDFS)
- Data: HBase (distributed database), Hive (data summarisation/ad hoc querying), Pig (data flow language)
- Algorithm: Mahout (machine learning and data mining library)
- Data In: Flume (manage log data), sqoop (bulk data transfer), Nutch (web search), Storm (realtime computation system)Also, there are Avro, Cassandra, Chukwa, ZooKeeper and MongoDB
- The setup and maintenance is quite complex for a small company, Cloudera provides software and support. There are also ready to use cloud based implementation such as Amazon EC and Treasure Data. The commercial version of Map Reduce framework is also available from Oracle Exadata, EMC Greenplum, IBM Pure Data, HP Vertica and Teradata Aster.
SAS is also supporting Hadoop, by providing connector between SAS and Hadoop.