Wednesday, March 30, 2016

Research about Graph Database


Knowledge Sharing:
1.  Authorization and access control solutions store information about parties (e.g., administrators) and resources (e.g., employees), together with the rules governing access to those resources. The control system they then apply these rules to determine who can access or manipulate a resource. Access control has traditionally been implemented either using directory services or by building a custom solution inside an application’s backend. Unfortunately, hierarchical directory structures developed on a relational database, suffer join pain as the dataset size grows, becoming slow and unresponsive, and ultimately delivering a poor end-user experience.
2. PageRank works by counting the number and quality of links to a page to determine a rough estimate of how important the website is. The underlying assumption is that more important websites are likely to receive more links from other websites.
3. In graph theory and network analysis, indicators of centrality identify the most important vertices within a graph. Applications include identifying the most influential person(s) in a social network, key infrastructure nodes in the Internet or urban networks, and super-spreaders of disease.
4. Social network analysis (SNA) is the process of investigating social structures through the use of network and graph theories.[1] It characterizes networked structures in terms of nodes (individual actors, people, or things within the network) and the ties or edges (relationships or interactions) that connect them.
5. Betweenness is a centrality measure of a vertex within a graph (there is also edge betweenness, which is not discussed here). Betweenness centrality quantifies the number of times a node acts as a bridge along the shortest path between two other nodes
6. The Resource Description Framework (RDF) is a family of World Wide Web Consortium (W3C) specifications[1] originally designed as a metadata data model.
7.  The RDF data model[2] is similar to classical conceptual modeling approaches such as entity–relationship or class diagrams, as it is based upon the idea of making statements about resources (in particular web resources) in the form of subject–predicate–object expressions.
8.  an entity–relationship model (ER model) is a data model for describing the data or information aspects of a business domain or its process requirements,
9.   A typical query language is Resource Description Framework (RDF). In a graph-based data model (RDF graph), the syntax structure is a set of triples, each consisting of a subject, a predicate, and an object.
10.  A transportation company has hundreds of locomotive assets, and each locomotive might have thousands of parts.
11.  Titan is a scalable graph database optimized for storing and querying graphs containing hundreds of billions of vertices and edges distributed across a multi-machine cluster. Titan is a transactional database that can support thousands of concurrent users executing complex graph traversals in real time.
12.  Scaling graph data processing for real time traversals and analytical queries is Titan’s foundational benefit.
14.  Titan is a graph database engine. Titan itself is focused on compact graph serialization, rich graph data modeling, and efficient query execution. In addition, Titan utilizes Hadoop for graph analytics and batch graph processing. Titan implements robust, modular interfaces for data persistence, data indexing, and client access. Titan’s modular architecture allows it to interoperate with a wide range of storage, index, and client technologies; it also eases the process of extending Titan to support new ones.
15.  Embed Titan inside the application executing Gremlin queries directly against the graph within the same JVM. Query execution, Titan’s caches, and transaction handling all happen in the same JVM as the application while data retrieval from the storage backend may be local or remote.
16.  Interact with a local or remote Titan instance by submitting Gremlin queries to the server. Titan natively supports the Gremlin Server component of the Tinkerpop stack.
17.  Gremlin is Titan’s query language used to retrieve data from and modify data in the graph. Gremlin is a path-oriented language which succinctly expresses complex graph traversals and mutation operations. Gremlin is a functional language whereby traversal operators are chained together to form path-like expressions. For example, "from Hercules, traverse to his father and then his father’s father and return the grandfather’s name.
18.  Gremlin is developed independently from Titan and supported by most graph databases. By building applications on top of Titan through the Gremlin query language users avoid vendor-lock in because their application can be migrated to other graph databases supporting Gremlin.
19.  Titan supports two different kinds of indexing to speed up query processing: graph indexes and vertex-centric indexes. Most graph queries start the traversal from a list of vertices or edges that are identified by their properties. Graph indexes make these global retrieval operations efficient on large graphs. Vertex-centric indexes speed up the actual traversal through the graph, in particular when traversing through vertices with many incident edges.
20.  Titan stores graphs in adjacency list format which means that a graph is stored as a collection of vertices with their adjacency list. The adjacency list of a vertex contains all of the vertex’s incident edges (and properties).
21.  Graph databases are based on graph theory. Graph databases employ nodes, properties, and edges.


Learning:
1.  With Graph database, we can ask some intelligent questions and get results from the query, instead of using smart data modeling with fixed schema and attributes, etc.
2.  Algorithm First.

Tools:
Neo4j
Titan (Tinkerpop)

Theory:
1.  CAP Theory: Consistency, Availability, Partionability

Theory:
Property Graph Model
RDF Graph
DAG

Terms:
Gremlin
RDF triples
GEL

References:
http://www.slideshare.net/verheughe/graph-connect-sf-4-oct-2013
https://www.dama.upc.edu/seminars/2nd-graph-ta/7GraphTAFebruary2014BarcelonaRoarAudunMyrsetandSebastianVerheugheUsingagraphdatabaseforresourceauthorization.pdf
http://neo4j.com/blog/access-control-lists-the-graph-database-way/
http://msdn.microsoft.com/en-us/library/azure/hh974476.aspx
http://neo4j.com/developer/guide-data-modeling/
http://www.slideshare.net/maxdemarzi/introduction-to-graph-databases-12735789
http://www.predictiveanalyticstoday.com/top-graph-databases/
https://en.wikipedia.org/wiki/Graph_database
http://tinkerpop.incubator.apache.org/docs/3.0.1-incubating/#intro
https://en.wikipedia.org/wiki/Bigtable
http://www.slideshare.net/knowfrominfo/titan-big-graph-data-with-cassandra?next_slideshow=1
http://www.predictiveanalyticstoday.com/top-graph-databases/