Archives For Big Data


Understanding the DNS of Data Science

Component-Parts-Data-Science

Data Science is the competitive advantage of the future for organizations interested in turning their data into a product through analytics. Industries from health, to national security, to finance, to energy can be improved by creating better data analytics through Data Science.

The Field Guide to Data Science

Download the full book

 


Big Data Tools for Data Developers and Data Analyst
Source: http://blog.profitbricks.com/top-45-big-data-tools-for-developers/

 

1. Splice Machine

@SpliceMachine
splicemachine

A real-time SQL-on-Hadoop database, Splice Machine takes Big Data beyond analytics with the ability to derive real-time, actionable insights for rapid decision-making. Not only can Splice Machine process real-time updates, but it offers the ability to utilize standard SQL and is capable of scaling out on commodity hardware. Splice Machine can be used in circumstances where MySQL or Oracle can’t scale.

Key Features: 

  • SQL-99 compliant, with standard ANSI SQL
  • Easily scales from gigabytes to petabytes using cost-effective, commodity hardware
  • Real-time updates with transactional integrity
  • Distributed computing architecture
  • Multiple Version Concurrency Control (MVCC)

Cost: Contact for a quote

2. Palantir

@PalantirTech
planatir
Palantir was founded in 2004 by a group of former PayPal employees and Stanford computer scientists. The company has doubled in size every year to date, but strives to maintain its startup culture. Offering a suite Big Data solutions for integrating, visualizing and analyzing information, Palantir’s product line emphasizes scalability, security, ease of use, and collaboration. Palantir’s solutions are most commonly used in intelligence, defense, financial and law enforcement applications, but it’s quickly growing in other verticals.

Key Features: 

  • Solutions for integrating, visualizing and analyzing data
  • Serves a multitude of industries with custom solutions
  • Exploit and analyze data
  • Extract data from multiple sources
  • Privacy and data protection policies
  • Simplify workflows by integrating data into a single dashboard

Cost: Contact for a quote

3. Attivio

@Attivio
attivio
As enterprises are coping with a broader variety of information sources, eliminating information silos is critical to gaining comprehensive insights and identifying key relationships among data. Attivio’s Active Intelligence Engine combines Big Data and Big Content to analyze everything, including human-generated text through advanced text analytics. Combined with universal indexing and automatic ad-hoc JOIN, Attivio is a powerful solution for making valuable connections between all your data.

Key Features: 

  • Combines Big Data and Big Content
  • Eliminates information silos
  • Adds context and signals from human-generated information sources
  • Supports BI/data visualization tools
  • In-engine analytics

Cost: Contact for a quote

4. Google Charts

googledevs
Google Charts is a free tool with a wide range of capabilities for visualizing data from a website. From simple charts to complex hierarchical tree maps, Google Charts offers a gallery with a multitude of pre-configured chart types to choose from. Google Charts is easily implemented by embedding simple JavaScript code on a website, yet it offers complex functionality with the ability to connect to dashboards, sort, modify, and filter data, connect to a database or integrate with and pull data from a website. Take it a step further by implementing the Chart Tools Datasource protocol and allow other entities to source data from your website.

Key Features: 

  • Charts exposed as JavaScript classes
  • Customize to match the look and feel of a website
  • Charts populated using DataTable class
  • Sort, modify, filter data
  • Populate data from a variety of sources

Cost: FREE

5. Mortar

@Mortardata

mortar
Mortar is a “general purpose platform for high-scale data science” designed to help data scientists spend more time analyzing their data and deriving actionable insights, instead of dedicating valuable time to building infrastructure and re-configuring systems. With Mortar, you can build a custom-built recommendation engine in days, not months.

Key Features: 

  • Open-source tools for building a recommendation engine
  • Built on Hadoop and Apache Pig
  • Create, test and run jobs from in-browser IDE
  • Snapshots monitor changes and progress
  • Instant feedback on code for fewer bugs

Cost: 

  • Public – FREE
  • Solo – $99/month
  • Team – $499/month

6. SAP

@SAPInMemory

sap
SAP’s HANA platform can be combined with Apache Hadoop for the ability to integrate and analyze massive loads of data in real time. The platform makes it possible to derive actionable insights by making valuable connections between all types of information, from a multitude of sources. Combine SAP HANA with applications that leverage Big Data insights to quickly create additional revenue streams and improve operations.

Key Features: 

  • Infinite storage
  • Flexible data management for all types of data
  • Discover insights with analytics solutions
  • Runs processes 1,000 to 100,000 times faster in-memory
  • SAP IQ analytics holds the Guinness World Record for data loading

Cost: Contact for a quote

7.  Cambridge Semantics

@CamSemantics

cambridge
Collecting, integrating, and analyzing Big Data doesn’t have to be a major effort. Cambridge Semantics makes it all possible with the Anzo Software Suite, an open platform for building Unified Information Access (UIA) solutions. That means replacing the information silos that leave data isolated and useless with a powerful, seamless data integration machine that streamlines data collection and enables sophisticated analysis for rapid decision making. And you can implement all this within hours or days — not the typical weeks or months required for an initiative at this level.

Key Features: 

  • Combine data from a multitude of sources
  • Customized, interactive web dashboards for analysis
  • Share spreadsheets in sync automatically
  • Useful for CRM, billing, project management and more

Cost: Contact for a quote

8. Fusion Charts

@FusionCharts

fusioncharts
Just because development is in your blood doesn’t mean you should embrace complexity when simpler solutions are readily available. Fusion Charts enables you to create sophisticated, cross-device compatible JavaScript charts with animation, rich interactivity, and impressive design with ease. Don’t spend your valuable time on child’s play; with Fusion Charts, you can spend more time on complex development tasks and deliver even better results.

Key Features: 

  • Developer resources center
  • Interactive zooming and scrolling
  • Real-time charts and gauges
  • Multi-lingual charts
  • Visually editable charts and gauges
  • Linked charts and a variety of effects

Cost: Contact for a quote

9. MarkLogic

@MarkLogic

marklogic
MarkLogic is built to support the world’s biggest data loads, bringing all types of relevant content back to users who can turn it into action. With real-time updates and alerts, connections between information make new opportunities immediately obvious. MarkLogic is ideal for enterprises that count on revenue through paid content search. With geographic data combined with content, location relevance is built in, and geographic boundaries make advanced data filtering possible.

Key Features: 

  • Range of Big Data solutions
  • Speeds development
  • Flexible APIs
  • NoSQL
  • Real-time analysis and updates
  • Bring all types of content back to end users

Cost: Contact for a quote

10. Syncsort

@Syncsort

syncsort
Syncsort offers a range of products and solutions to help you tap into Big Data. Hadoop, Linux, Unix and Windows, and Mainframe solutions, Syncsort’s product lineup offers a solution to meet practically any configuration needs. A GUI-based solution, Syncsort enables developers to create solutions for collecting, processing, and distributing more data in less time.

Key Features: 

  • Solutions for Hadoop, Mainframe, Windows, Linux, Unix
  • Lowers the barriers to Hadoop adoption
  • Eliminates the need for custom code for Hadoop implementation
  • High-performance sorting
  • Improve efficiency

Cost: Contact for a quote

11. DataStax

@DataStax

datastax
DataStax helps companies like Netflix, Healthcare Anytime, eBay, and even Adobe harness the power of Big Data with less effort and at a lower cost than traditional solutions. Tapped as the first alternative to Oracle, DataStax provides the constant uptime and lightning speed required for modern customer-facing applications. When you need the capacity to handle massive data loads at maximum speed for real-time analysis, DataStax packs a major punch with a robust visual query tool for developers.

Key Features: 

  • Visual query tool for developers
  • Create and run Cassandra Query Language (CQL) queries and commands
  • Visually navigate and interact with data clusters
  • Works with DataStax Community and Enterprise editions

Cost: Contact for a quote

12. Guavus

@Guavus

Guavus
Need to create a rich, engaging and meaningful customer experience? Guavus drives better decision making with powerful analytics capabilities combined with advanced data science and the ability to distill data in real time to derive actionable insights at the precise moment of opportunity. Through continuous correlation and ongoing analysis, vast amounts of static and dynamic data are handled with ease and revealing opportunities to generate more revenue, reduce overhead costs, and monetize new streams.

Key Features: 

  • Analyze-First Analytics Architecture
  • Analyze high-volume data streams in near real time
  • Handles multiple data sources with ease
  • Continual data analysis from moment of capture

Cost: Contact for a quote

13. mongoDB

@MongoDB

MongoDB
An open-source document database, mongoDB is ideal for developers who want precise control over the final results and processes for handling Big Data. With full index support, you have flexibility to index any attribute and scale horizontally without compromising functionality. Rich, document-based queries and GridFS for storing files of any size without the risk of compromising your stack, mongoDB is a scaleable, flexible, and powerful solution for Big Data.

Key Features: 

  • Open-source platform
  • Document-oriented storage
  • Flexible aggregation and data processing
  • Full index support
  • GridFS

Cost: FREE

14. Infochimps Cloud

@Infochimps

infochimps
A cloud service solution for Big Data, Infochimps Cloud makes it possible to deploy Big Data applications rapidly and without the typical time commitment. For applications requiring real-time analysis, multi-source streaming data, a NoSQL database, or a Hadoop cluster, Infochimps Cloud offers a solution that facilitates rapid implementation. Real-time analytics, ad hoc analytics, and batch analytics comprise Infochimps Cloud’s three essential cloud services.

Key Features: 

  • Integrate with any data source – CRM solutions, etc.
  • Log analysis
  • Mobile data analytics
  • Fraud detection and risk analysis
  • Ad targeting
  • Customer insights via social media sources, website clickstreams and more

Cost: Contact for a quote

15. Pentaho

@Pentaho

pentaho
Pentaho brings IT and business users together by joining data integration and business analytics for integrating, visualizing, analyzing and blending Big Data in ways never before possible for better business results. When you need the ability to put robust information at your users’ fingertips in real time and at a reasonable cost, Pentaho’s open, embeddable and extensible analytics platform makes it easy to visualize, explore, and predict — turning data into value.

Key Features: 

  • High-volume data processing
  • Adaptive Big Data layer
  • Data mining and predictive analysis
  • Instaview – data to insights in 3 steps

Cost: Contact for a quote

16. Karmasphere

@Karmasphere

karmasphere
Designed for teams of analysts who need to explore and analyze Big Data on Hadoop, Karmasphere is a key solution for self-service analytics on Hadoop. With companies seeking more effective and efficient ways to make use of Big Data, users can turn data into real business value by discovering insights and influencing outcomes.

Key Features: 

  • Organized dashboard
  • SQL data explorer
  • 250-plus pre-packaged Hadoop algorithms
  • SAS, SPSS and R Analytic Models
  • Dynamic data lenses for self-service analytics

Cost: Contact for a quote

17. Placed

@Placed

placed
Placed facilitates data collection from offline sources, enabling enterprises to derive actionable insights through a combined analysis of both offline and online behavior and data metrics. Placed targeting and placed attribution facilitates better results from mobile advertising by mapping the relationship between people and places by capitalizing on Big Data capabilities.

Key Features: 

  • Measure visitation trends over time
  • Measures 100 million locations a day, across more than 100,000 opted-in US smartphones
  • Inference Pipeline references a place database with nearly 300 million features for the US alone
  • Largest repository of offline insights into the paths and behaviors of consumers
  • Audience segmentation by demographics and other data points
  • Affinity modeling for understanding relationships between data
  • Monitor and understand how consumer behavior changes over time

Cost: Contact for a quote

18. Upsight

@GetUpsight

Upsight
Upsight, formerly Kontagent, provides actionable analytics for developers to understand what’s happening with your apps and derive actionable insights from data to impact acquisition, engagement, retention and revenue. The platform also enables the creation of targeted in-app and out-of-app metrics in line with KPIs.

Key Features: 

  • Free, enterprise-grade analytics
  • Unlimited data storage
  • Data mining with Hadoop
  • Measure anything from social apps to games and mobile dating apps
  • Funnel analysis
  • Cohort explorer
  • Predictive LTV

Cost:

  • FREE – Analytics and unlimited data storage, 250k push
  • Core – $500/month - Custom Events up to 100k MAU, 500k push
  • Pro – $2,000/month - Custom Events up to 250K MAU, 1M push
  • Enterprise – Starting at $3,000/month - Unlimited Data Storage & Custom Events + Data Mine + Predictive LTV + A/B, unlimited push

19. Talend

@Talend

talend
Talend Open Studio is “a powerful and versatile set of open source products for developing, testing, deploying and administrating data management and application integration projects.” Providing a unified environment for managing the full lifecycle, even across enterprise boundaries, Talend enables developers to reclaim their productivity with a fully integrated platform for joining data integration, data quality, MDM, application integration and big data.

Key Features: 

  • Data integration at a cluster scale
  • No need to write or maintain code
  • Works with leading Hadoop distributions
  • Pull source data from anywhere, including NoSQL

Cost: FREE

20. Jaspersoft

@Jaspersoft

jaspersoft
Connect and visualize data for Hadoop Analytics, MongoDB Analytics, Cassandra Analytics, and other platforms in one central repository. Using Big Data, developers can configure reports, analytics, dashboards, and more, without having to migrate data to multiple databases.

Key Features: 

  • Real-time analytics
  • Integrate all your data
  • Blend data through innovative data virtualization metadata layer or raditional data warehouse using ETL
  • Present integrated visualizations and dashboards within your apps
  • Create intuitive design tools for non-designers to create visualizations

Cost: 

  • Free
  • Jaspersoft for AWS – Less than $1/hour

21. Keen IO

@Keen_io

keenio
With powerful APIs for gathering all the data you need and deriving the actionable insights you need to drive your business forward, Keen IO is a powerful, flexible, and scaleable solution that’s literally Big Data, easy-to-implement and at your fingertips.

Key Features: 

  • Send as much data as you want, from any source
  • Set up event data on any action, such as upgrades, impressions or purchases
  • Arbitrary JSON format
  • Custom properties

Cost: 

  • Developer – FREE – 50,000 events/month
  • Startup – $20/month – 100,000 events/month
  • Growth – $125/month – 1M events/month
  • Premium – $300/month – 4M events/month
  • Professional – $600/month – 8M events/month
  • Business – $1,000/month – 15M events/month
  • Enterprise – $2,000/month – 50M events/month
  • Custom – Negotiable – 50M – 100B events/month

22. Skytree

@SkytreeHQ

skytree
High-performance machine learning on Big Data for advanced analytics, Skytree offers the ideal platform for fully exploiting the opportunities presented by Big Data. With a multitude of industry-focused solutions as well as solutions encompassing everything from predictive analytics to algorithmic pricing, Skytree is a comprehensive Machine Learning platform emphasizing the growing importance of Predictive Analytics in Big Data.

Key Features:

  • Business Analytics range from value analytics to fraud detection and what-if analytics
  • Marketing Analytics offer solutions ranging from ad optimization to lead scoring and recommender systems
  • Only general purpose scalable Machine Learning system on the market
  • Highest accuracy on the market; unprecedented speed and scale
  • Power Packs modules are plugged into the Skytree Server Foundation

Cost: Contact for a quote

23. Tableau Software

@tableau
tableau

Tableau was launched by a computer scientist, an Academy Award-winning professor, and a business leader with a passion for data. This perfect trio created a powerful suite of solutions designed to put more data at users’ fingertips — and help them understand it in more meaningful ways. With an advanced query language for powerful visualizations, the ability to natively query databases, cubes, warehouses, and more, a lightning-fast in-memory analytics database designed to eliminate silos and more, Tableau addresses every corner of Big Data demands.

Key Features: 

  • In-memory analytics database eliminates memory silos
  • Leverages the complete memory hierarchy from disk to L1 cache
  • Tableau Public – free tool bringing data to life on the web
  • Touch, swipe and tab functionality for mobile
  • Easily layer in additional data sources
  • Access any data with a few clicks

Cost: 

  • Tableau Public – FREE
  • Tableau Desktop, Tableau Server, and Tableau Online – Contact for a quote

24. Splunk

@Splunk

splunk
Splunk harnesses all the machine data created by websites, applications, servers, networks, sensors, mobile devices, and other sources to monitor actions, activities and events, analyzing those data sources to derive actionable insights. Splunk is a self-contained software package downloadable and functional on any device.

Key Features: 

  • Derive insights from Big Data with speed and simplicity
  • Works on most major Hadoop distributions, including including first-generation MapReduce and YARN
  • Splunk Hadoop Connect enables bi-directional integration
  • Real-time collection, indexing, and analyzing

Cost: 

  • Splunk Storm – FREE cloud service for developers
  • Splunk Enterprise – Perpetual License – Starts at $4,500 for 1 GB/day, plus annual support fees
  • Splunk Enterprise – Term License – Start at $1,800 per year, which include annual support fees
  • Hunk – One-year term license of Hunk starts at $2,500 per Hadoop TaskTracker or Compute Node with a minimum of ten TaskTrackers or Compute Nodes
  • Splunk Cloud – Annual subscription pricing, data volumes of 5GB/day to 1TB/day

25. Platfora

@Platfora

platfora
Platfora hides the complex nature of Hadoop, making it simpler for enterprises to discover and understand facts in their business across events, actions, behaviors and time. Built by Silicon Valley veterans who have built market-leading companies around big ideas, the Platfora team understands the power of Big Data and aims to change customers’ lives with Platfora as they’ve done with companies in the past.

Key Features: 

  • Vizboards for self-service, interactive data visualization
  • Analytics Engine, In-Memory Accelerator, and Hadoop Processor
  • Entity-centric data catalog
  • Build interest-driven pipelines of facts
  • Analyze data iteratively with segmentation
  • Collaboration features
  • On-premise or cloud deployment

Cost: Contact for a quote

26. Continuuity

@Continunity

continuity
Continuuity enables developers to build Big Data applications quickly, easily, and seamlessly, deploying instantly on-premise or to the cloud. It’s all made possible through simple APIs that can be used with virtually any platforms.

Key Features: 

  • User-implemented real-time stream processors (Flows)
  • Process a batch of data objects with the same transaction
  • More than one instance possible with each Flowlet
  • Programmatic control with REST interfaces
  • Three partitioning strategies to choose from
  • DataSets for higher-level abstractions

Cost:  Contact for a quote

27. BitDeli

@Bitdeli

bitdeli
BitDeli is an analytics tool for GitHub, enabling developers to gather data on who is viewing their repositories, where and when. With a one-click install, you can easily add analytics to your repositories and start gathering valuable data, including aggregate statistics across all forks for a given repository.

Key Features: 

  • One-click install
  • Automatically generated pull requests
  • Trending badge indicator shows repository popularity
  • Global rankings for comparison
  • Fork aggregation for a broad picture of project health

Cost: Based on GitHub Enterprise pricing

28. Flurry

@FlurryMobile

Flurry
Flurry is an end-to-end solution for analyzing consumer behavior, advertising to the right audience, at the right time, and discovering new ways to monetize audiences. Flurry makes use of 3.5 billion app session reports per day totaling more than 3 terabytes to provide valuable insights for app developers, such as a deep understanding of the user base, engagement benchmarks, and other key metrics.

Key Features: 

  • Demographic estimations
  • App engagement benchmarks
  • App category and consumer interests
  • World’s largest app-audience data set
  • Reach more than 250 million customers per month
  • Data-powered targeting

Cost:

  • Flurry Analytics – FREE
  • Flurry AppCircle, FlurryPersonas, Flurry AppSpot – Contact for a quote

29. Spring Data /Pivotal

@springcentral

Spring Data
Spring Data is a set of projects designed to make it easier to use new data access technologies and provide improved support for relational database technologies. An umbrella project with many sub-projects designed for specific databases, Spring Data is an excellent source of Big Data tools to streamline the use of Big Data in the modern enterprise environment.

Key Features: 

  • Support for Hadoop, mongoDB, Data Rest and more
  • Also provides consulting services
  • Customized, all-in-one Eclipse-based distribution
  • Tool suites for ready-to-use solutions

Cost: FREE

30. Hortonworks

@hortonworks

hortonworks
Hortonworks offers a complete distribution for fully harnessing the power of Hadoop. With Hortonworks, you can process and analyze everything from Sentiment to Sensors with a 100% open-source, enterprise-grade distribution of Hadoop for every platform.

Key Features: 

  • Interact with all data, in multiple ways, simultaneously
  • Stable, tested, complete package of all services required for the platform
  • Integrates with other tools
  • HDP is built and supported by original architects, builders and operators of Hadoop

Cost: FREE

31. StatsMix

@cooperio

statsmix
A complete dashboard for all your business needs, StatsMix dashboards are customized to your specific requirements, with the data you need from sources like Salesforce, MySQL, Google Analytics, and other tools and services.

Key Features: 

  • Chart and track anything
  • Measure KPIs
  • Track any metric with API
  • Automatic social monitoring
  • Share metrics and dashboards via email, embed them, or create guest accounts
  • Aggregate metrics to eliminate silos
  • Custom dashboards

Cost: 

  • Basic – $24/month – 100k API requests
  • Standard – $49/month – 300k API requests
  • Pro – $99/month – 1M API requests
  • Premium – $199/month – 3M API requests
  • Enterprise – $499/month – 8M API requests

32. Pervasive

@ActianCorp

actian
Pervasive offers a number of Big Data tools, including several solutions for Hadoop and a free RushLoader for Hadoop. From DataFlow Analytics to ParAccel Dataflow ETL/DQ for designing end-to-end ETL and quality data workflows, Pervasive is a Big Data power suite.

Key Features: 

  • Partnership with Actian for powering Big Data 2.0
  • Predictive Analytics for Big Data
  • Simple interface for loading massive amounts of data at rapid speeds
  • Fastest data-crunching engine in the world

Cost: 

  • RushLoader for Hadoop – FREE
  • ParAccel Dataflow Loader for Hadoop – FREE for 12 months
  • All other products – Contact for a quote

33. InfiniDB

@InfiniDB

infinidb
Real-time enablement of Big Data insights is yours with InfiniDB. A 100% open-source platform, you can harness the power of Big Data without the typical cost.

Key Features: 

  • Three open-source versions available
  • Completely MySQL-accessible
  • Familiar, MySQL interface for large-scale, ad hoc BI
  • Dimensional and predictive analytics
  • Integrates with the Hadoop ™ Distributed File System (HDFS)
  • Real-time, ad hoc analytics within an Apache Hadoop cluster

Cost: FREE

34. GridGrain

@GridGrain

gridgrain
GridGrain reimagines in-memory computing for a competitive edge in the modern business environment. Nikita Ivanov and Dmitriy Setrakyan share a passion for high-performance computing, a shared vision on which they based the first release of GridGrain in 2007. The list of features, functionality and capabilities of these solutions is astounding.

Key Features:

  • In-Memory Data Grid
  • Supports SQL, K/V, MongoDB, MPP, MapReduce
  • Hyper Clustering
  • Zero Deployment
  • Advanced Security
  • Fault Tolerance
  • Load Balancing
  • Customizable Event Workflow
  • Programmatic Querying
  • Minimal or no integration
  • No ETL required
  • Eliminate MapReduce overhead
  • Works with any Hadoop distribution

Cost: Contact for a quote

 35. DeepDive

@HazyResearch

Deepdive
A new type of system to help developers analyze data on a deeper level, DeepDive is an open-source project with a simple four-step process for writing applications on the platform. With calibrated probabilities for every assertion it makes, DeepDive is designed to navigate around the problematic nature of human error in development.

Key Features: 

  • Handles large amounts of data from multiple sources
  • Write simple rules and offer feedback on prediction accuracy
  • “Distantly” learns, rather than requiring a tedious machine-learning process for training predictions
  • Scaleable, high-performance inference and learning engine

Cost: FREE

36. Lavastorm Analytics

@Lavastorm_News

lavastorm
Business turn to analytics to develop faster, better, and cheaper methods and processes. Interestingly, that’s precisely what Lavastorm offers in its platform: faster, better, and cheaper analytics for achieving business goals. Enterprises are demanding more capabilities and faster speeds, and Lavastorm eliminates the need for a disjointed approach with visualization tools, spreadsheets, BI applications, databases and other information silos with a seamless solution delivering end-to-end analytics.

Key Features: 

  • Reduce analytic development by 90% or more
  • Large volumes of data in short amounts of time
  • Reuse and share analytics knowledge across teams
  • Detect hard-to-find issues with 40% less false positives
  • Visibility control for management and executives

Cost: Contact for a quote

37. SpagoWorld

@SpagoWorld

spagoworld
From business intelligence to middleware, SpagoWorld offers a range of solutions for enterprises — all on an open-source platform. SpagoWorld’s Big Data BI solution enables the collection of massive quantities of data, in rapid timeframes, for use across SpagoWorld’s other platforms for further analysis and business intelligence derivatives.

Key Features: 

  • Extract data from various platforms, from database and analytics platforms to NoSQL databases or enriched distributions
  • Supports real-time analysis of streaming data
  • Charts, reports, thematic maps, cockpits
  • Translate information to self-service BI
  • Reporting, multi-dimensional analysis
  • Ad hoc reporting
  • Location intelligence
  • Real-time dashboards and console

Cost: FREE

38. RapidMiner

@RapidMiner

rapidminer
A code-free zone, RapidMiner provides advanced analytics with no programming required for configuration. With all-new application wizards for churn reduction, sentiment analysis, predictive maintenance and direct marketing in RapidMiner 6.0, this tool is one of the fastest advanced analytics solutions available.

Key Features: 

  • Hundreds of methods for data integration
  • Runs on every major platform
  • No programming required
  • Drag-and-drop interface

Cost: Contact for a quote

39. Orange

orange
Orange is an open-source data visualization and analysis tool for both novices and experts. Data mining is conducted either through visual programming or Python scripting, with components for machine learning and ad-ons for bioinformatics and text mining.

Key Features: 

  • Remembers choices and makes suggestions
  • Intelligently chooses communication channels between widgets
  • Packed with visualization options from bar charts to dendograms
  • Integration and data analytics
  • Combine widgets to design the framework of your choice
  • Toolbox with more than 100 widgets

Cost: FREE

40. OpenDataSoft

@opendatasoft

opendatasoft
OpenDataSoft is a comprehensive discovery tool with maps, charts, and graphs to explore public data sets. A cloud-based platform, OpenDataSoft is designed for seamless and unlimited data publishing, sharing, and resuse.

Key Features: 

  • Reuse data through APIs and apps models
  • Collect data from any source
  • Read and understand all formats
  • Make databases findable and reusable
  • Standard access formats
  • Interactive & shareable visualization
  • API factory
  • Web extensions and open source

Cost (pricing based on Euros): 

  • FREE – Civic initiatives and academic projects
  • 200/month – 100k records, 20K UI/API queries/day
  • 700/month – 10M records, 100K UI/API queries/day
  • Contact for a quote – Unlimited records, UI/API queries/day

 

41. Angoss

@Angoss
angoss

A comprehensive marketing analytics solution, Angoss offers real-time Big Data insights for a variety of verticals and business sectors. From credit scoring to opportunity and lead scoring, fraud deterrence and claims management, Angoss is capable of capturing and analyzing data for a multitude of applications.

Key Features:

  • Automated workflows to develop scorecards
  • Select the most predictive variables
  • Advanced predictive modeling
  • Angoss Decision and Strategy Trees
  • Data preparation and profiling
  • Model validation and deployment

Cost: Contact for a quote

42. Mu Sigma

@MuSigmaInc


musigma

Mu Sigma is one of the world’s largest Decision Sciences and analytics firms, helping companies to institutionalize data-driven decision making by harnessing Big Data. With a set of proprietary platforms to enable rapid decision-making and comprehensive data collection and integration that eliminates information silos, Mu Sigma is a powerful tool for machine learning, operational research, artificial intelligence, and more.

Key Features: 

  • Hosts Mu Sigma problem DNAs
  • Real-time analytics and event stream processing
  • Load models into an enterprise ecosystem for consumption
  • Embedded advanced analytics engine
  • Influence analysis and topic modeling
  • Sentiment evaluation
  • Easily scaled on commodity hardware

Cost: Contact for a quote

43. ERwin

@ERwinModeling


erwin

A collaborative data-monitoring environment, ERwin offers an intuitive, graphical interface with a centralized view of key definitions, enabling the leveraging of data as strategic business asset. The product is comprised of a number of editions designed for different stakeholders within an organization, providing a targeted level of information availability and display and configurations for better understanding and usability.

Key Features: 

  • Achieve business agility through model-driven collaboration
  • Collaborate via web or desktop
  • Active model templates and naming standards
  • Display themes, custom data types, macro language and API
  • Custom reporting
  • Metadata integration tools

Cost: 

  • CA ERwin Data Modeler Standard Edition r9.5 – Product plus 1 Year Enterprise Maintenance – $4,794
  • CA ERwin Data Modeler Standard Edition r9.5 – Product plus 3 Years Enterprise Maintenance – $6,392
  • CA ERwin Data Modeler Workgroup Edition r9.5 – Product plus 1 Year Enterprise Maintenance – $6,708
  • CA ERwin Data Modeler Workgroup Edition r9.5 – Product plus 3 Years Enterprise Maintenance – $8,944
  • CA ERwin r9.5 Data Modeler for Microsoft SQL Azure – Product plus 1 Year Enterprise Maintenance – $1,679.94
  • CA ERwin r9.5 Data Modeler for Microsoft SQL Azure – Product plus 3 Years Enterprise Maintenance – $2,239.92
  • CA ERwin r9.5 Web Portal Standard Edition 1-5 Users – Product plus 1 Year Enterprise Maintenance – $8,399.70
  • CA ERwin 9.5 Web Portal Standard Edition 1-5 Users – Product plus 3 Years Enterprise Maintenance – $11,199.60

44. HPCC Systems

@hpccsystems

hpcc
A proven and battle-tested platform for manipulating, querying, transforming, and data warehousing Big Data, HPCC Systems solves Big Data problems facing modern enterprises in any vertical.

Key Features: 

  • Processing clusters use off-the-shelf hardware
  • Clusters typically homogeneous, but not required
  • Distributed, Thor, Roxie file systems
  • Linux operating system
  • Build multi-key, multi-field (aka compound) indexes on DFS files
  • Data warehouse capabilities for structured queries and data analysis applications
  • Supports thousands of users with sub-second response time, depending on application

Cost: FREE

45. pmOne

@pmOneAG

pmone
Offering Big Data and Business Intelligence solutions, pmOne’s cMORE enables users to quickly build, flexibly grow and efficiently administer solutions. It leverages and extends SQL Server functionality, as well as that of Excel, SharePoint, and other components in the Microsoft BI stack.

Key Features: 

  • Simplified standard and ad hoc reporting
  • Credible alternative to SAP-based data warehouse
  • Consistent reporting company-wide
  • Personalize reports; distribute books
  • Easy access to SAP data and other systems
  • Based on Microsoft BI

Predictive analytics and other categories of advanced analytics are becoming a major factor in the analytics market. We evaluate the leading providers of advanced analytics platforms that are used to build solutions from scratch.

Advanced Analytics Platforms

Advanced Analytics Platforms

Source: http://www.gartner.com/technology/reprints.do?id=1-1QXWE6S&ct=140219&st=sb.


Announcing the new Professional Master’s program in Big Data

The School of Computing Science at Simon Fraser University is offering a NEW Professional Masters program in Big Data. This four-semester, hands-on program will prepare you for an exciting and well-paid career as a data scientist.

This program is intended for students with some previous programming experience who wish to learn about the state-of-the-art in big data analysis.

Applications are NOW OPEN

Application deadline is April 1, 2014 for the Fall 2014 Cohort. We will keep admissions open until we fill available slots for the Fall 2014 cohort. Our goal is to notify students of acceptance to the Program  by May 1st, 2014.

REGISTER FOR THE BIG DATA PROGRAM


This interactive illustration represents how twitter’s employees interact among themselves. This stunning design was created by Santiago Ortiz. It shows vividly the pivotal employees in Twitter’s twittersphere. To view the interactive visualization, click here.

twittersphere


Machine Learning Tools
Machine Learning Tools

 

BigML

BigML is a cloud-based machine learning platform that allows users to create visual predictive models using raw data and structured datasets. Last month, BigML announced the availability of the 2014 winter release, which includes features that boost predictive modeling. The company also introduced a new paradigm called Programmatic Machine Learning that is the “ability to programmatically transform a dataset via a high-level language and a cloud-based API together.”

The BigML API makes it possible for developers to build applications that incorporate predictive models and near real-time predictions.

Datumbox

Datumbox is a machine learning platform that focuses on natural language processing (NLP). The Datumbox platform features a variety of functions including sentiment analysis, Twitter sentiment analysis, language detection, educational detection and keyword extraction.

The Datumbox API provides programmatic access to the platform’s natural language processing and text analysis functions. ProgrammableWeb recently published an article featuring the Datumbox API.

Diffbot

Diffbot uses computer vision, machine learning and other technologies to extract text, images, links, HTML attributes and other elements from Web pages. In August 2013, the company released the Diffbot Product API, which can extract product information from the pages of e-commerce websites. Earlier this month, ProgrammableWeb reported on the release of 35+ new Diffbot client libraries in a variety of programming languages.

The company provides a suite of Diffbot APIs for extracting data from Web page news articles, Web site home pages, e-commerce product pages and other types of Web pages. There are also APIs for extracting Web page images and automatically classifying Web page links.

Ersatz Labs

Ersatz Labs is a startup and developer of a new platform called Ersatz, described by the company as “the first cloud-based neural network platform.” The Ersatz platform allows developers to build applications that utilize deep neural networks without the need to have extensive knowledge in machine learning.

There is an API that can be accessed via HTTP, and a client library in Python is also available, so Ersatz can be easily integrated with Web, mobile and desktop applications. Ersatz is currently in private beta, and developers interested in participating can request an invitation on the official company Web site.

Google Prediction API

The Google Prediction API provides developers access to Google’s cloud-based machine learning platform and pattern-matching functions. The API is used in conjunction with the Google Cloud Storage API and allows developers to incorporate functions into their apps such as sentiment analysis, spam detection, message routing decisions, suspicious activity identification and more.

IBM Watson

IBM Watson is a machine learning platform that focuses on NLP, hypothesis generation and evidence-based learning. In November 2013, ProgrammableWeb reported that IBM had launched the Watson Developer Cloud, a cloud-based marketplace that provides access to APIs, documentation, self-service training materials and other tools for developers to build IBM Watson-powered applications.

Last month, IBM announced that the company will invest more than $1 billion in the new Watson Group, which will be based in New York City’s “Silicon Alley.” The new group will focus on developing and promoting the IBM Watson platform and cognitive technologies. IBM also announced new Watson cognitive intelligence-based services, including IBM Watson Discovery Advisor, IBM Watson Analytics and IBM Watson Explorer.

Logical Glue

Logical Glue is a machine learning as-a-service (MLaaS) platform that features predictive model building, predictive model real-time deployment, and real-time predictive analytics. The platform is designed to predict customer behavior for many types of markets, particularly financial lending, insurance and marketing.

The Logical Glue platform is currently in private beta; however, companies can apply to participate in the beta program, which allows them access to the platform prerelease. The next release of the platform will include the Logical Glue prediction API.

Parse.ly

Parse.ly is a predictive content optimization and analytics platform designed for blogs, news sites and other online publishers. The home page of the Parse.ly website describes the company as “The Content Performance Authority” and the platform provides users a real-time view of article traffic based on individual posts, authors, sections and referrers. The Parse.ly platform also provides views of content metrics, social network shares, site activity and other analytics.

The Parse.ly API allows developers to programmatically access platform features such as analytics, shares, referrers, real-time, search and recommendations. There are also mobile SDKs available that can be integrated into third-party apps so reader activity can be tracked.

PredictionIO

PredictionIO is a machine learning server that allows developers to add predictive features to software, web and mobile applications. PredictionIO is open source and can be installed on a stand-alone server. There is also a cloud version available on Amazon EC2/Amazon EBS.

The PredictionIO API enables applications to collect and manage app data and add predictive features such as predict user preferences, personalized content, content discovery, content recommendations and more. ProgrammableWeb recently published an interview with Simon Chan (cofounder and CEO of PredictionIO), which covers PredictionIO features, compares other machine learning APIs and more.

 

SwiftKey

SwiftKey is a developer of touchscreen keyboard applications and word prediction technology. SwiftKey’s products Keyboard, Flow and Note incorporate machine learning and SwiftKey’s language technology, available to developers via API and SDK.

A recent TechCrunch article featured SwiftKey’s word prediction technology. Nathan Matias, a PhD student at the MIT Media Lab, used SwiftKey technology to create a sonnet essentially co-authored by Shakespeare and generated entirely from the SwiftKey next word suggestions.

Read more: http://blog.programmableweb.com/2014/02/21/machine-learning-and-predictive-analytics-key-to-business-growth-and-profitability/#ixzz2u5Vip200 


GraphLab Create is a Python package that enables developers and data scientists to apply machine learning to build state of the art data products. Get started fast with our fully customizable GraphLab DataApps. GraphLab Create is fast, scalable and makes it easy to deploy your apps to the Cloud.

Python

http://graphlab.com/


Configuring an Oozie job with a HDInsight Hadoop cluster.

SQL on Hadoop

February 18, 2014 — Leave a comment

Source: http://www.it-director.com/content.php?cid=14668

The good thing about running SQL on Hadoop is that SQL is a declarative language, which means that you don’t need to know where the data is, you just have to ask for it and then the database works out how to get the information you need. However, unless you have a database optimiser the performance will suck.

Now there are various SQL initiatives around but probably the most advanced is Impala. And in version 1.2, which was introduced at the end of December, Cloudera introduced facilities to optimise join order but, while this is a step in the right direction, it hardly constitutes a full-blown optimiser.

However, a couple of related announcements have caught my eye this week. The first was that Calpont has changed its name to the name of its product InfiniDB, it has raised another round of funding and it has announced version 4.5 of its database with an Enterprise Management dashboard. None of which has much to do with Hadoop except that it reminded me that Calpont (as it then was) announced the availability of InfiniDB running on Hadoop back last year, along with an open source license. And, of course, InfiniDB has a grown-up optimiser.

Another product that has an adult optimiser is HP Vertica. And MapR has just announced an early access program (prior to general availability in March) for the HP Vertica Analytics Platform running on the MapR Hadoop distribution.

The truth is that you will get much better performance—orders of magnitude better—from either InfiniDB or Vertica than you will from Impala. So this poses three questions: firstly, will we see more vendors porting their warehouse products onto Hadoop (or HDFS); secondly, how quickly will Cloudera or HortonWorks (with its SQL implementation) be able to produce an optimiser than can compete reasonably well with these intruders into their market; and, thirdly, how much does this matter?

The answer to the first question is yes. I don’t who or when but this is the general trend, not just in data warehousing but across a variety of markets. The answer to the second question is not soon: it takes years to develop a good optimiser—probably not as many years as it used to, because there is plenty of experience out there, which was not the case historically—but still a significant period.

Thirdly, yes it matters. You may have to pay a license fee for HP Vertica (or not, in the case of InfiniDB) but the performance advantages you get from having a decent optimiser will mean that you need significantly less hardware in order to get comparable performance, and that should more than offset any such license fees. And that also explains why I expect more vendors to do the same thing as InfiniDB and Vertica, because there is a window of opportunity while Cloudera gets its optimiser up to speed.

Source: http://www.it-director.com/content.php?cid=14668


Long marketed as a way to master huge quantities of data, Hadoop is now booming because its proponents have learned to sell it small.
More: http://readwrite.com/2014/02/17/hadoop-adoption-big-data-small-bytes#feed=/enterprise&awesm=~owgJzF09F3A7c7

Typical Stages and Milestones of Big Data Adoption

Typical Stages and Milestones of Big Data Adoption


UD professor presented award for natural language processing work

The Association for Mathematics of Language (SIGMOL) honored the University of Delaware’s Vijay K. Shanker with the inaugural S.Y. Kuroda Prize on Jan. 7 for his work in natural language processing.

The award is given for long-lasting advancements in mathematical linguistics. Shanker, professor in the Department of Computer and Information Sciences, was selected for his work on the convergence of mildly context-sensitive (MCS) formalisms.


 Sills needed to become  a Data Scientist

Sills needed to become a Data Scientist


The-Explosion-of-the-Internet-of-Things

The-Explosion-of-the-Internet-of-Things


#BigData skills pay top dollar Nine of the highest paying 10 IT jobs are for skills related to Big Data

R: $115,5312.
NoSQL: $114,7963.
MapReduce: $114,3964.
PMBok: $112,3825.
Cassandra: $112,3826.
Omnigraffle: $111,0397.
Pig: $109,5618.
Service Oriented Architecture:$108,9979.
Hadoop: $108,66910.
Mongo DB: $107,825

http://www.computerworld.com/s/article/9246120/Big_Data_skills_pay_top_dollar


Environment : Windows 7 – 64bit and Virtual Box

VM Image : CDH4 Packages for Virtual Box

 CDH4 Packages for Virtual Box

1. Create a new Virtual Machine

Create new Cloudera Hadoop virtual machine

2. Enter a name for New Virtual Machine and Select the type of the quest operating system you plan to install into the virtual machine

SNAGHTML2630631e

 

3. Select the amount of base memory (RAM)

Setting up Cloudera Hadoop in Windows

4. Select “ Use existing hard disk” and navigate to the folder where you downloaded Cloudera-demo-vm. 
If you don’t have demo VM, download it from here: CDH4 Packages for Virtual Box

Running Cloudera Hadoop in Windows 7

5. Your going to create a new virtual Cloudera Hadoop in Windows 7 operating system

SNAGHTML26388bd3

6. Now turn on Cloudera Hadoop in Windows and run the demo

image

 

7. Starting Cloudera Hadoop in Windows virtual machine

SNAGHTML263d1472

8. Cloudera Hadoop demo is now ready in Windows

SNAGHTML2641e7e7

Running the VM

Once you launch the VM, you are automatically logged in as the cloudera user.
The account details are:

  • username: cloudera
  • password: cloudera

The cloudera account has sudo privileges in the VM.

To learn more about Hadoop, see the Hadoop Tutorial.

You can access status through the browser at the following URLs:

  • NameNode status (localhost:50070)
  • JobTracker status (localhost:50030)
  • The Hue user interface (localhost:8888)
  • The HBase web UI (localhost:60010)

 

image

Enjoy your Cloudera Hadoop demo in Windows