Network Analysis and Visualization

Network analysis is a methodological framework that treats social structures as systems of interconnected elements. In the context of social media research, the elements are typically users, posts, hashtags, or other digital artefacts, and …

Network Analysis and Visualization

Network analysis is a methodological framework that treats social structures as systems of interconnected elements. In the context of social media research, the elements are typically users, posts, hashtags, or other digital artefacts, and the connections represent interactions such as follows, mentions, retweets, likes, or co‑occurrences. Understanding the vocabulary that underpins network analysis and visualization is essential for constructing, analysing, and interpreting social media data in a rigorous manner.

Node – also called a vertex – is the fundamental unit of a network. In a Twitter network a node might be an individual account, while in a hashtag co‑occurrence network a node represents a specific tag. Nodes can carry attributes (e.G., User location, account age, follower count) that are later mapped to visual properties such as colour or size.

Edge – the link that joins two nodes – represents a relationship or interaction. Edges may be directed, indicating a one‑way flow (e.G., A follow from user A to user B), or undirected, indicating a reciprocal or symmetric relationship (e.G., A mutual friendship on Facebook). The edge can also be weighted, where the weight quantifies the strength or frequency of the interaction, such as the number of retweets between two accounts.

Graph – the mathematical representation of a set of nodes and edges – is the formal structure used for analysis. A graph can be denoted G = (V, E) where V is the set of vertices (nodes) and E is the set of edges. In social media research, graphs are often constructed from API data dumps, scraped web pages, or platform‑specific export files.

Directed graph – a graph in which every edge has an orientation – is essential when the direction of information flow matters. For example, in a citation network the direction indicates which paper references which other paper. In a Twitter mention network, a directed edge from user A to user B shows that A has mentioned B.

Undirected graph – a graph where edges have no orientation – is appropriate when the relationship is inherently reciprocal. A Facebook friendship network is typically modelled as undirected because the platform enforces mutual acceptance.

Weighted graph – a graph where each edge carries a numeric value – allows researchers to differentiate between strong and weak ties. In a retweet network, the weight could be the total number of retweets between two users over a defined period. Weighted graphs enable more nuanced centrality calculations and community detection.

Adjacency matrix – a square matrix that records the presence or absence (and sometimes weight) of edges between each pair of nodes – is a core data structure. The entry a_ij = 1 (or a positive weight) indicates an edge from node i to node j; a_ij = 0 indicates no edge. For undirected graphs the matrix is symmetric, while for directed graphs it is generally asymmetric. Large social media networks often produce sparse adjacency matrices, where most entries are zero.

Incidence matrix – a rectangular matrix that records the relationship between nodes and edges – is another representation. Rows correspond to nodes, columns to edges, and the entries indicate whether a node participates in a given edge. Incidence matrices are useful for certain algebraic calculations, such as determining the rank of a network.

Degree – the number of edges attached to a node – is a basic measure of activity. In a directed graph, each node has an in‑degree (number of incoming edges) and an out‑degree (number of outgoing edges). High‑degree nodes often act as hubs, attracting attention or disseminating information widely.

Degree centrality – the normalized form of degree that scales the raw count by the maximum possible degree – provides a comparable metric across networks of different sizes. Nodes with the highest degree centrality are typically influencers or broadcasters in social media contexts.

Betweenness centrality – a measure that captures how often a node lies on the shortest paths between other node pairs – identifies brokers or gatekeepers. A user with high betweenness may control the flow of information between otherwise disconnected communities, making them a strategic target for diffusion campaigns.

Closeness centrality – the inverse of the average shortest path length from a node to all other reachable nodes – indicates how quickly a node can reach the rest of the network. In a fast‑moving platform like Twitter, users with high closeness can spread messages efficiently.

Eigenvector centrality – a recursive metric that assigns higher scores to nodes that are connected to other high‑scoring nodes – reflects prestige. PageRank, the algorithm that powers Google’s search rankings, is a variant of eigenvector centrality adapted for directed, weighted graphs. In social media research, eigenvector centrality can highlight accounts that are not only well‑connected but also linked to other influential accounts.

Clustering coefficient – the proportion of a node’s neighbours that are also connected to each other – measures local cohesion. A high clustering coefficient suggests a tightly knit community, which may correspond to echo chambers or interest groups on platforms like Reddit.

Network density – the ratio of existing edges to the total possible edges – provides a global sense of how saturated a network is. Social media networks are typically sparse, with densities far below 1, because the number of potential connections grows quadratically with the number of users.

Path – a sequence of edges that connects a series of nodes – is the backbone of many analytical concepts. The shortest path, also called the geodesic, is the path with the minimum number of edges (or minimum cumulative weight) between two nodes.

Diameter – the longest shortest path in the network – defines the maximum distance between any two nodes. Small diameters (the “small‑world” phenomenon) are common in online platforms, indicating that any two users are separated by only a few intermediaries.

Assortativity – the tendency of nodes to connect with others that are similar in some attribute (e.G., Degree, age, location) – reveals homophily patterns. Positive degree assortativity means high‑degree nodes preferentially link to other high‑degree nodes, a pattern observed in many professional networking sites.

Homophily – the principle that similarity breeds connection – is a driving force behind community formation. In hashtag networks, for instance, users who share political ideology are more likely to co‑use the same tags.

Community detection – the process of partitioning a network into sub‑graphs (communities) that are more densely connected internally than externally – is essential for uncovering latent structures. Algorithms such as modularity optimisation, Infomap, and Leiden are widely used. The resulting communities can be interpreted as interest groups, activist clusters, or brand fan bases.

Modularity – a quality function that quantifies the strength of division of a network into communities – ranges from –1 to 1, with higher values indicating clearer community structure. When modularity is low, the network may be more diffuse, suggesting a lack of cohesive sub‑groups.

Structural holes – gaps between otherwise disconnected clusters – provide opportunities for brokerage. An actor that spans a structural hole can access non‑redundant information, gaining a competitive advantage. In Twitter, a user who follows two otherwise separate activist circles can act as a bridge for cross‑movement dialogue.

Brokerage – the act of mediating interactions between otherwise disconnected actors – can be measured using specialised centrality metrics such as the brokerage score. Brokerage analysis helps identify users who facilitate knowledge transfer across organisational boundaries.

Ego network – the sub‑graph centred on a focal node (the ego) and all nodes directly connected to it (the alters) – is useful for micro‑level analysis. Ego networks reveal the immediate social environment of a user, allowing researchers to examine personal influence, support structures, or exposure to diverse viewpoints.

Sociocentric network – a network that attempts to capture all ties within a bounded population – contrasts with ego‑centric approaches. In a platform‑wide study, a sociocentric network might include all interactions among a defined set of accounts (e.G., All verified political journalists on Twitter).

Affiliation network – a bipartite graph that links two distinct types of nodes (e.G., Users and hashtags, or actors and events) – is a powerful way to model co‑participation. Projecting a bipartite network onto one mode (e.G., Users) yields a co‑affiliation network where edges represent shared affiliations.

Bipartite graph – a graph whose nodes can be divided into two disjoint sets such that edges only run between sets – is the formal term for affiliation networks. Visualising bipartite graphs often requires special layout techniques to keep the two node types distinct.

Multiplex network – a network that contains multiple layers of ties, each representing a different type of relationship (e.G., Follows, likes, mentions) – captures the richness of social media interaction. Multiplex analysis can reveal whether certain users dominate across layers or whether different layers exhibit divergent community structures.

Dynamic network – a network that evolves over time – is crucial for studying diffusion, virality, or the rise and fall of online movements. Temporal slices (snapshots) or event‑based windows (e.G., Before and after a major news event) allow researchers to track changes in centrality, community composition, and network density.

Temporal network – a specific type of dynamic network where edges are stamped with precise timestamps – enables fine‑grained analysis of cascade dynamics. By ordering edges chronologically, one can reconstruct the exact pathway of a meme’s spread.

Network visualization – the graphical representation of nodes and edges – serves both exploratory and communicative purposes. Effective visualisation balances aesthetic clarity with accurate encoding of structural information.

Layout algorithm – the computational method that determines the spatial positioning of nodes – influences readability. Common algorithms include force‑directed (e.G., Fruchterman‑Reingold), circular, hierarchical (Sugiyama), and geographic (GIS‑based) layouts. Force‑directed layouts treat edges as springs and nodes as repelling particles, producing intuitive cluster formations.

Force‑directed layout – a layout that simulates physical forces – is the default in many tools because it tends to reveal community structures without prior knowledge. However, for very large networks the algorithm can become computationally intensive, requiring simplification or sampling.

Node size – a visual attribute that can encode a quantitative metric such as degree, betweenness, or activity level – helps viewers quickly identify important actors. Over‑sizing nodes can obscure underlying topology, so a balanced scaling approach is recommended.

Node colour – another visual channel – is often used to map categorical attributes (e.G., Political affiliation, language, device type) or continuous variables (e.G., Sentiment score) via a colour gradient. Colour choice should consider colour‑blind accessibility; palettes such as viridis or Tableau’s Color Blind Safe set are advisable.

Edge thickness – the visual weight of a line – typically reflects edge weight. Thicker edges indicate stronger ties, such as a high frequency of interactions between two users. In dense networks, varying edge thickness can aid in distinguishing salient connections.

Edge colour – can encode direction (e.G., Gradient from source to target) or type (e.G., Retweet vs. Reply). Using subtle colour differences prevents visual clutter while still conveying important distinctions.

Attribute mapping – the process of linking data attributes to visual properties – is central to effective visualisation. It requires careful preprocessing, including normalisation, discretisation, and outlier handling, to avoid misleading representations.

Geographic information system (GIS) integration – the combination of spatial data with network data – enables the mapping of online interactions onto physical locations. For example, plotting the origins of tweets about a local protest can reveal geographic patterns of mobilisation.

Data source – the origin of network data – influences the completeness and bias of the resulting graph. Common sources for social media network analysis include:

Twitter API – provides endpoints for follower lists, retweet histories, mentions, and user timelines. The standard API imposes rate limits; the academic research track expands limits and offers full‑archive search.

Facebook Graph API – supplies data on page likes, comments, and shares. Access is subject to app review and privacy restrictions, especially after recent policy changes.

Instagram Basic Display API – delivers public profile information and media objects, but does not expose follower relationships, limiting network construction.

Reddit API (PRAW) – enables extraction of submission and comment trees, facilitating discussion‑network analysis.

Web scraping – can complement API data when certain interactions are not exposed, though it raises legal and ethical considerations.

Software tools – for constructing, analysing, and visualising networks – range from point‑and‑click applications to programmable libraries:

Gephi – an open‑source desktop application with a rich set of layout algorithms, dynamic network support, and plug‑ins for community detection. Gephi’s Data Laboratory allows rapid attribute editing, while the Preview mode offers high‑resolution export options.

NodeXL – an Excel add‑in that simplifies data import from social media platforms and provides built‑in metrics. NodeXL is well suited for teaching and for analysts comfortable with spreadsheet workflows.

UCINET – a comprehensive suite for social network analysis, offering advanced statistical tests (e.G., Exponential random graph models) and matrix operations. UCINET pairs with NetDraw for visualisation.

Pajek – designed for handling extremely large networks (millions of nodes) with efficient memory management. Pajek includes specialised algorithms for blockmodeling and hierarchical clustering.

Cytoscape – originally developed for biological networks, Cytoscape’s extensible architecture supports social media data through plug‑ins such as CyREST and stringApp.

R – igraph package – provides a programmable environment for network creation, manipulation, and statistical analysis. The package integrates with ggplot2 for custom visualisation, and with the tidygraph ecosystem for tidy‑style data pipelines.

Python – NetworkX – a flexible library for constructing and analysing graphs, with strong compatibility with pandas and matplotlib. For large‑scale visualisation, NetworkX can export to Graphviz or to WebGL‑based tools like sigma.Js.

Challenges – in network analysis of social media data are numerous and must be addressed systematically:

Data quality – API responses can be incomplete due to rate limits, privacy settings, or platform restrictions. Missing edges distort centrality measures; researchers should document data collection windows and consider imputation or sensitivity analysis.

Sampling bias – often arises when only a subset of users is collected (e.G., Using keyword filters). This can over‑represent highly active users and under‑represent peripheral participants, inflating perceived influence.

Scale – social media networks can involve millions of nodes and billions of edges. Processing such volumes requires parallel computation, graph databases (e.G., Neo4j), or sampling techniques such as snowball sampling, random node sampling, or edge sparsification.

Computational complexity – many algorithms (e.G., Betweenness centrality) have worst‑case O(n^3) time, making them infeasible on large graphs. Approximation algorithms, such as Brandes’ algorithm for betweenness or Louvain for community detection, alleviate this burden.

Temporal resolution – choosing the appropriate time window influences dynamic analysis. Too coarse a window may mask rapid diffusion events; too fine a window can produce fragmented networks with many isolated nodes.

Privacy and ethics – even when data are publicly available, aggregating and visualising them can reveal sensitive patterns. Researchers must follow platform terms of service, obtain ethical clearance where required, and consider anonymisation techniques (e.G., Hashing user IDs, aggregating at the community level).

Interpretation pitfalls – include conflating correlation with causation (e.G., Assuming high betweenness causes influence), over‑reliance on visual patterns without statistical validation, and ignoring the role of offline contexts that shape online behaviour.

Practical application examples – illustrate how the terminology is employed in real research:

Example 1 – Political mobilisation on Twitter. Researchers collected all tweets containing the hashtag #Vote2024 during a one‑week period. Nodes were user accounts; edges were directed retweets. Degree centrality identified the top amplifiers, while betweenness highlighted users bridging partisan clusters. Community detection revealed three distinct political camps. Visualisation employed a force‑directed layout with node colour encoding party affiliation and node size reflecting retweet volume. Temporal slices before and after a televised debate showed a surge in cross‑camp edges, indicating momentary dialogue.

Example 2 – Brand sentiment network on Instagram. By scraping public comments on brand‑related posts, a bipartite graph was built linking users to sentiment tags (e.G., #love, #disappointed). Projecting onto the sentiment nodes produced a co‑occurrence network where edge weight indicated the number of users expressing both sentiments. Modularity analysis uncovered clusters of mixed sentiment, suggesting nuanced consumer attitudes. Edge thickness visualised the strength of co‑sentiment ties, while node colour represented sentiment polarity.

Example 3 – Information diffusion during a natural disaster on Facebook. Using the Graph API, researchers accessed public posts from a regional community page. Nodes represented posts; edges represented shares. A temporal network model captured the cascade depth. Betweenness centrality pinpointed the original source of the most widely shared safety information. GIS integration mapped the location of users who reshared the post, revealing hotspots of offline assistance requests.

Example 4 – Academic collaboration on ResearchGate. A multiplex network was constructed with two layers: Co‑authorship (undirected, weighted by number of joint papers) and mentorship (directed, derived from advisor‑advisee relationships). Eigenvector centrality on the mentorship layer identified senior scholars, while community detection on the co‑authorship layer revealed interdisciplinary clusters. Visualising both layers simultaneously required a multiplex‑aware layout that preserved node positions across layers while differentiating edge types by colour and line style.

Data preprocessing steps – are critical to ensure the reliability of the subsequent analysis:

1. Data cleaning – remove duplicate records, resolve inconsistent user identifiers (e.G., Case‑sensitive usernames), and filter out bots using activity thresholds or machine‑learning classifiers.

2. Edge filtering – apply weight thresholds to discard trivial interactions that may add noise (e.G., Single‑mention edges in a high‑volume dataset). Researchers must justify the threshold choice to avoid arbitrarily pruning meaningful ties.

3. Attribute enrichment – augment nodes with external data such as demographic information, location coordinates, or sentiment scores derived from natural language processing. Enriched attributes enable richer visual encodings and multivariate analysis.

4. Normalization – scale quantitative attributes (e.G., Follower count) to a common range before mapping to visual properties. Log‑transformation is common for skewed distributions typical of social media metrics.

5. Sampling – when the full network is infeasible to process, employ stratified sampling to preserve community structure. Techniques such as forest‑fire sampling retain the network’s degree distribution more faithfully than random node sampling.

Typical analytical workflow – can be summarised in sequential stages:

Stage 1 – Define research question and select appropriate network representation (e.G., Directed retweet network vs. Bipartite hashtag‑user network).

Stage 2 – Acquire data via API calls, web scraping, or data dumps, ensuring compliance with platform policies.

Stage 3 – Store raw data in a structured format (e.G., CSV for edge list, JSON for node attributes) and back up the dataset.

Stage 4 – Preprocess the data: Clean, deduplicate, enrich, and filter edges according to analytical goals.

Stage 5 – Construct the graph using a library (e.G., Igraph) or import into a visual tool (e.G., Gephi). Verify that the graph’s properties (node count, edge count, directedness) align with expectations.

Stage 6 – Compute descriptive statistics (density, average degree, degree distribution) to characterise the overall network.

Stage 7 – Calculate centrality measures relevant to the research question (degree, betweenness, eigenvector). Identify top‑ranking nodes for further qualitative investigation.

Stage 8 – Perform community detection and assess modularity. Examine the composition of each community using node attributes to interpret the social meaning.

Stage 9 – If the study is temporal, segment the data into intervals, repeat steps 6‑8 for each slice, and compare metrics over time to detect trends or shocks.

Stage 10 – Design visualisations: Choose layout, map attributes to visual channels, apply filters to reduce clutter, and generate high‑resolution exports for reports.

Stage 11 – Conduct statistical validation (e.G., Permutation tests for assortativity, exponential random graph models for tie formation) to substantiate observed patterns.

Stage 12 – Document methodology, limitations, and ethical considerations. Prepare a reproducible script or workflow repository (e.G., On GitHub) to enable peer verification.

Interpretation guidance – helps avoid misreading visual artefacts:

- Recognise that node size exaggeration can create the illusion of hub dominance; always cross‑check visual impressions with numeric centrality scores.

- Be cautious when inferring causality from betweenness; a node may appear central simply because it resides in a dense subgraph rather than because it actively brokers information.

- When community colours appear to overlap, verify that the underlying modularity is statistically significant; random graphs often produce apparent clusters that are artefacts of visual layout.

- In dynamic visualisations, animate changes slowly enough for viewers to track node movements; abrupt jumps may suggest network instability that is actually a product of layout re‑initialisation.

Advanced topics – expand the foundational vocabulary into specialised domains:

Exponential random graph models (ERGMs) – statistical models that estimate the probability of tie formation based on node attributes and structural tendencies (e.G., Reciprocity, transitivity). ERGMs provide a formal test of hypothesised mechanisms driving network structure.

Stochastic block models (SBMs) – generative models that partition nodes into blocks with distinct intra‑ and inter‑block connection probabilities. SBMs are useful for uncovering latent community structures that may not be captured by modularity‑based methods.

Network motifs – recurring small sub‑graph patterns (e.G., Triads, feed‑forward loops) that can indicate functional building blocks. Motif analysis on communication networks can reveal typical interaction sequences, such as question‑answer‑feedback loops.

Sentiment‑augmented networks – integrate textual sentiment scores as edge weights, enabling the study of affective flow. For instance, a weighted retweet network where edge weight equals the average sentiment of retweeted messages can highlight the propagation of positive versus negative narratives.

Multilayer networks – extend multiplex concepts by allowing inter‑layer edges that connect nodes across layers (e.G., A user’s Twitter account linked to their Instagram account). Multilayer analysis can uncover cross‑platform influence patterns.

Graph embedding – machine‑learning techniques that map nodes to low‑dimensional vector spaces while preserving structural similarity. Embeddings such as node2vec or DeepWalk support downstream tasks like node classification, link prediction, and anomaly detection.

Link prediction – the task of estimating the likelihood of future edges based on current network structure and node attributes. In social media, link prediction can forecast emerging relationships, such as potential collaborations among influencers.

Network robustness – evaluates how the network’s connectivity degrades under node or edge removal. Simulating targeted attacks (e.G., Removing high‑betweenness nodes) versus random failures informs resilience assessments, which are relevant for platform moderation strategies.

Visualization interactivity – modern web‑based tools (e.G., D3.Js, sigma.Js, Neo4j Bloom) allow users to hover over nodes for attribute pop‑ups, filter by degree, or dynamically re‑apply layout algorithms. Interactive visualisations support exploratory analysis and stakeholder engagement.

Ethical visualisation practices – include avoiding the depiction of personally identifiable information, providing context for visual encodings, and offering alternative text descriptions for accessibility. When publishing network diagrams, consider aggregating nodes to protect privacy while preserving analytical insight.

Case study synthesis – to illustrate the integration of terminology, consider a comprehensive research project on misinformation spread during a health crisis:

1. Data were harvested from Twitter using the academic API, focusing on tweets containing disease‑related keywords and a set of known misinformation hashtags.

2. A directed, weighted retweet network was built; edges were weighted by the number of retweets between user pairs.

3. Nodes were enriched with bot probability scores (from Botometer) and user location (derived from profile metadata).

4. Degree centrality identified prolific spreaders; betweenness centrality highlighted accounts that linked misinformation clusters to mainstream discourse.

5. Community detection using the Leiden algorithm uncovered three major clusters: Health authorities, mainstream media, and misinformation propagators.

6. Modularity was high (0.62), Confirming a strong community structure.

7. A temporal analysis with weekly snapshots revealed that after the release of an official health guideline, the betweenness of a certain journalist increased, indicating a bridging role that helped disseminate accurate information into the misinformation cluster.

8. Visualization employed a force‑directed layout, with node colour representing cluster membership, node size reflecting retweet volume, and edge thickness indicating retweet count. Bot accounts were highlighted in red to draw attention to automated amplification.

9. Ethical considerations included removing user handles from the final figure, aggregating nodes at the community level for public dissemination, and providing a data‑availability statement that respects platform terms.

Through this example, each of the key terms – node, edge, directed, weighted, centrality, community detection, modularity, temporal network, visualization, ethical practice – is operationalised within a real‑world research context.

By mastering this terminology, scholars and practitioners can design robust network studies, select appropriate analytical techniques, and produce clear visual narratives that illuminate the complex social dynamics of digital platforms. The precision of language not only facilitates methodological rigour but also ensures that findings are communicated effectively to interdisciplinary audiences, policy makers, and the broader public.

Key takeaways

  • In the context of social media research, the elements are typically users, posts, hashtags, or other digital artefacts, and the connections represent interactions such as follows, mentions, retweets, likes, or co‑occurrences.
  • In a Twitter network a node might be an individual account, while in a hashtag co‑occurrence network a node represents a specific tag.
  • The edge can also be weighted, where the weight quantifies the strength or frequency of the interaction, such as the number of retweets between two accounts.
  • In social media research, graphs are often constructed from API data dumps, scraped web pages, or platform‑specific export files.
  • Directed graph – a graph in which every edge has an orientation – is essential when the direction of information flow matters.
  • Undirected graph – a graph where edges have no orientation – is appropriate when the relationship is inherently reciprocal.
  • Weighted graph – a graph where each edge carries a numeric value – allows researchers to differentiate between strong and weak ties.
June 2026 intake · open enrolment
from £90 GBP
Enrol