Tax Law and Natural Language Processing
Expert-defined terms from the Advanced Certification in AI in Tax Law (France) course at London School of Business and Administration. Free to read, free to share, paired with a professional course.
Artificial Intelligence #
The field of computer science that creates systems capable of performing tasks that normally require human intelligence. Related terms: machine learning, deep learning. Example: AI models that predict tax audit risk. Challenges: bias in training data, regulatory compliance.
Algorithmic Transparency #
The degree to which the inner workings of an algorithm are understandable to stakeholders. Related terms: explainability, black‑box. Example: Publishing the logic used to flag potentially abusive tax arrangements. Challenges: protecting proprietary code while meeting disclosure obligations.
Anti‑Abuse Rule #
Legal provisions designed to prevent the misuse of tax benefits. Related terms: general anti‑avoidance rule (GAAR), specific anti‑avoidance rule. Example: Applying the French anti‑abuse rule to a hybrid entity structure. Challenges: interpreting intent, balancing legitimate planning with deterrence.
API (Application Programming Interface) #
A set of protocols that allows software components to communicate. Related terms: REST, SOAP. Example: An API that feeds transaction data from ERP systems into a tax‑compliance NLP engine. Challenges: versioning, security, data privacy.
Article 54 of the CGI #
French tax code article governing the taxation of dividends. Related terms: Corporate Income Tax, double taxation. Example: Using NLP to extract dividend‑related clauses from shareholder agreements. Challenges: interpreting nuanced language across jurisdictions.
Attenuation Principle #
The concept that tax benefits should diminish when the underlying transaction loses economic substance. Related terms: substance over form, economic reality. Example: NLP models flagging contracts where the declared purpose diverges from the economic effect. Challenges: quantifying attenuation, data scarcity.
Automated Tax Decision Engine #
Software that autonomously determines tax outcomes based on predefined rules. Related terms: rule‑based system, decision tree. Example: An engine that classifies invoices as taxable or exempt using NLP‑extracted line items. Challenges: maintaining rule sets, handling exceptions.
Base Erosion and Profit Shifting (BEPS) #
OECD initiative to curb tax avoidance strategies that exploit gaps in tax rules. Related terms: transfer pricing, IP‑based profit shifting. Example: NLP analysis of intercompany agreements to detect BEPS‑related language. Challenges: multilingual documents, evolving guidance.
Beneficial Ownership #
The natural person who ultimately owns or controls a legal entity. Related terms: ultimate parent, control. Example: Using entity‑resolution NLP to link shareholders across filings. Challenges: incomplete data, privacy restrictions.
Big Data #
Extremely large datasets that require advanced processing techniques. Related terms: data lake, distributed computing. Example: Analyzing millions of VAT invoices with NLP to uncover compliance trends. Challenges: storage costs, data quality.
Binding Corporate Rules (BCR) #
Internal policies that allow multinational companies to transfer personal data within the group. Related terms: GDPR, data protection. Example: Ensuring that NLP‑driven tax analytics respect BCR obligations. Challenges: cross‑border data flows, auditability.
Blind Spot (NLP) #
Areas where a language model lacks coverage or accuracy. Related terms: out‑of‑vocabulary, domain drift. Example: Missing legal jargon specific to French tax statutes. Challenges: continuous model retraining, domain‑specific corpora.
Blockchain Taxation #
The application of tax rules to transactions recorded on distributed ledgers. Related terms: crypto‑assets, ledger analytics. Example: NLP parsing of smart‑contract code to determine taxable events. Challenges: anonymity, rapid regulatory change.
Business Taxonomy #
Structured classification of business activities for tax purposes. Related terms: NACE, NAF. Example: Training an NLP classifier to assign NAF codes from company descriptions. Challenges: ambiguous language, evolving industry terms.
Capital Gains Tax (CGT) #
Tax on the profit realized from the sale of assets. Related terms: realisation, exemption. Example: NLP extraction of asset disposal dates from contracts to compute CGT liability. Challenges: multi‑jurisdictional rules, timing nuances.
Chatbot Compliance Assistant #
Conversational AI that helps taxpayers understand obligations. Related terms: virtual assistant, dialogue system. Example: A French‑language chatbot that answers queries on VAT filing deadlines. Challenges: accurate legal content, language nuance.
Classification Accuracy #
Metric measuring the proportion of correct predictions in a categorisation task. Related terms: precision, recall. Example: Evaluating an NLP model that classifies documents as “tax‑relevant” vs. “non‑relevant.” Challenges: class imbalance, overfitting.
Clustering (NLP) #
Grouping similar textual items without predefined labels. Related terms: unsupervised learning, topic modeling. Example: Clustering tax rulings to discover emerging interpretative trends. Challenges: determining optimal cluster number, interpretability.
Code of Ethics (AI) #
Guidelines governing responsible AI development and deployment. Related terms: fairness, accountability. Example: Ensuring that tax‑risk models do not discriminate against small enterprises. Challenges: translating abstract principles into concrete checks.
Compliance Monitoring #
Ongoing observation of activities to ensure adherence to regulations. Related terms: audit trail, risk scoring. Example: Real‑time NLP scanning of invoices for missing tax identifiers. Challenges: false positives, system latency.
Corporate Income Tax (CIT) #
Tax levied on the profits of companies. Related terms: tax base, effective tax rate. Example: Using NLP to extract deductible expenses from financial statements. Challenges: differing fiscal years, statutory limits.
Cross‑Border Taxation #
Tax rules applicable to transactions involving multiple jurisdictions. Related terms: double taxation treaty, withholding tax. Example: NLP analysis of cross‑border service contracts to identify treaty benefits. Challenges: divergent legal language, treaty hierarchy.
Data Anonymization #
Process of removing personally identifiable information from datasets. Related terms: pseudonymisation, privacy‑by‑design. Example: Stripping taxpayer IDs from a corpus used to train a tax‑compliance model. Challenges: re‑identification risk, utility loss.
Data Governance #
Framework for managing data availability, usability, integrity, and security. Related terms: data stewardship, metadata. Example: Defining policies for storing extracted tax clause metadata. Challenges: cross‑department alignment, regulatory updates.
Data Pipeline #
Sequence of processes that move data from source to destination. Related terms: ETL, stream processing. Example: Ingesting raw PDF rulings, applying OCR, then NLP tagging. Challenges: error propagation, scaling.
Data Quality #
The condition of data based on accuracy, completeness, and consistency. Related terms: data cleansing, validation. Example: Verifying that extracted VAT rates match official tables. Challenges: manual correction costs, source heterogeneity.
Data Lake #
Central repository that stores raw data in its native format. Related terms: schema‑on‑read, big data. Example: Housing all tax‑related documents for downstream NLP analysis. Challenges: governance, searchability.
Data Minimisation #
Principle of collecting only data necessary for a specific purpose. Related terms: GDPR, purpose limitation. Example: Limiting the NLP model to extract only tax‑relevant entities, not full text. Challenges: balancing model performance with privacy.
Deep Learning #
Subset of machine learning using neural networks with many layers. Related terms: transformer, convolutional network. Example: Fine‑tuning a BERT model on French tax rulings. Challenges: computational cost, interpretability.
Denial‑of‑Service (DoS) Attack #
Malicious attempt to disrupt services by overwhelming resources. Related terms: cybersecurity, rate limiting. Example: Protecting an online tax‑question answering portal from DoS attacks. Challenges: maintaining availability while allowing legitimate traffic.
Digital Signature #
Cryptographic method to verify authenticity of electronic documents. Related terms: PKI, non‑repudiation. Example: Signing XML tax filings generated by an NLP engine. Challenges: key management, cross‑border recognition.
Document Classification #
Process of assigning categories to texts. Related terms: text categorisation, supervised learning. Example: Classifying tax notices as “assessment,” “reminder,” or “notice of appeal.” Challenges: overlapping categories, evolving terminology.
Domain Adaptation #
Technique for transferring a model trained on one domain to another. Related terms: transfer learning, fine‑tuning. Example: Adapting a general‑purpose French language model to the niche vocabulary of French tax law. Challenges: limited domain data, catastrophic forgetting.
Entity Recognition (NER) #
Identifying and classifying named entities in text. Related terms: tokenisation, span extraction. Example: Extracting VAT numbers, dates, and monetary amounts from invoices. Challenges: ambiguous entity boundaries, multilingual entities.
Entity Resolution #
Matching records that refer to the same real‑world entity. Related terms: deduplication, record linkage. Example: Linking a taxpayer’s different filings across years despite variations in naming. Challenges: false matches, scalability.
European Union VAT Directive #
Directive governing the common VAT system across EU member states. Related terms: VAT OSS, intra‑EU supply. Example: NLP parsing of cross‑border e‑commerce contracts to apply the correct VAT rate. Challenges: divergent national implementations, frequent amendments.
Exhaustive Search #
Technique that evaluates every possible solution to find the optimal one. Related terms: brute force, combinatorial optimisation. Example: Searching all possible tax‑treatment combinations for a complex restructuring. Challenges: computational infeasibility for large problem spaces.
Explainable AI (XAI) #
Methods that make AI decisions understandable to humans. Related terms: interpretability, model‑agnostic. Example: Providing a rationale for why an NLP model flagged a clause as “potentially abusive.” Challenges: balancing fidelity with simplicity, regulatory acceptance.
Fact‑Finding Interview #
Structured dialogue to gather factual information. Related terms: questionnaire, knowledge elicitation. Example: An AI‑driven interview that collects transaction details for tax compliance. Challenges: ensuring completeness, avoiding leading questions.
FATCA (Foreign Account Tax Compliance Act) #
US law requiring foreign financial institutions to report on US account holders. Related terms: CRS, information exchange. Example: NLP extraction of account holder names from bank statements for FATCA reporting. Challenges: data volume, cross‑border privacy.
Feature Engineering #
Process of creating input variables for machine learning models. Related terms: feature extraction, dimensionality reduction. Example: Deriving “sentence length” and “presence of tax keywords” as features for a classification model. Challenges: domain expertise needed, over‑fitting risk.
Fiscal Year #
The 12‑month period used for accounting and tax reporting. Related terms: financial year, tax period. Example: Aligning NLP‑derived transaction dates with the appropriate fiscal year for CIT calculations. Challenges: differing start dates across entities.
FON (Fiscal Object Number) #
French identifier for the object of a tax assessment. Related terms: tax reference, assessment number. Example: Extracting FON values from scanned assessment letters using OCR‑combined NLP. Challenges: varied document layouts, OCR errors.
Forward‑Looking Tax Risk Model #
Predictive system that anticipates future tax compliance issues. Related terms: risk scoring, scenario analysis. Example: Using NLP‑derived contract clauses to forecast audit likelihood. Challenges: data lag, uncertainty quantification.
GAAR (General Anti‑Avoidance Rule) #
Broad rule that allows tax authorities to ignore transactions lacking economic substance. Related terms: anti‑abuse rule, substance over form. Example: NLP detecting “substance‑less” language in intercompany financing agreements. Challenges: subjective interpretation, litigation risk.
Graph Neural Network (GNN) #
Neural architecture that processes data represented as graphs. Related terms: node embedding, message passing. Example: Modelling relationships between entities (taxpayer, subsidiary, jurisdiction) to detect circular structures. Challenges: data sparsity, explainability.
Hybrid Tax Model #
Combination of rule‑based and machine‑learning components. Related terms: ensemble, knowledge‑driven AI. Example: A system that first applies deterministic VAT rules, then uses NLP to resolve ambiguous cases. Challenges: integration complexity, maintenance overhead.
Information Retrieval (IR) #
Process of obtaining relevant documents from a large collection. Related terms: search engine, ranking. Example: An IR system that returns relevant tax rulings based on a user query. Challenges: query ambiguity, relevance feedback.
Inference Engine #
Component that applies logical rules to derive conclusions. Related terms: expert system, forward chaining. Example: An inference engine that combines extracted tax clauses to determine overall liability. Challenges: rule explosion, conflict resolution.
Input Normalisation #
Standardising raw data into a consistent format before processing. Related terms: pre‑processing, tokenisation. Example: Converting varied date formats in contracts to ISO‑8601 before NLP analysis. Challenges: handling exceptions, locale variations.
International Tax Treaty #
Agreement between two countries to avoid double taxation. Related terms: tax convention, Article 12. Example: NLP extraction of treaty‑benefit clauses from bilateral agreements. Challenges: treaty hierarchy, language differences.
Knowledge Graph #
Structured representation of entities and their relationships. Related terms: semantic network, ontology. Example: Building a graph linking tax concepts, statutes, and case law for query answering. Challenges: ontology alignment, data freshness.
Legislative Text Mining #
Applying NLP techniques to statutes, regulations, and official bulletins. Related terms: text analytics, semantic parsing. Example: Mining the French Tax Code to identify changes affecting digital services. Challenges: legal drafting style, amendment tracking.
Legal Ontology #
Formal specification of concepts and relationships in law. Related terms: taxonomy, semantic web. Example: Defining entities such as “taxable event,” “exemption,” and “penalty” for NLP annotation. Challenges: consensus building, extensibility.
Lexical Ambiguity #
Situation where a word has multiple meanings. Related terms: polysemy, homonymy. Example: The term “tax” could refer to a levy or to a tax‑related document. Challenges: disambiguation requires context, especially in short snippets.
Linear Regression #
Statistical method for modelling the relationship between a dependent variable and one or more independents. Related terms: predictive modeling, least squares. Example: Estimating the impact of a tax rate change on revenue using historic data. Challenges: assumptions of linearity, outlier sensitivity.
Machine Translation (MT) #
Automatic conversion of text from one language to another. Related terms: neural MT, post‑editing. Example: Translating German tax rulings into French for comparative analysis. Challenges: preserving legal nuance, domain‑specific terminology.
Metamodel #
Model that defines the structure of other models. Related terms: meta‑ontology, schema. Example: A metamodel describing how tax‑related entities should be represented in a knowledge graph. Challenges: ensuring flexibility without loss of coherence.
Model Drift #
Degradation of model performance over time due to changes in data distribution. Related terms: concept drift, retraining. Example: An NLP model trained on 2020 tax forms misclassifying 2024 forms after regulatory updates. Challenges: monitoring, timely retraining.
Monte Carlo Simulation #
Technique that uses random sampling to estimate statistical properties. Related terms: probabilistic modeling, risk analysis. Example: Simulating multiple tax‑audit scenarios to assess exposure. Challenges: computational intensity, input distribution assumptions.
Named Entity Disambiguation (NED) #
Resolving which specific entity a mention refers to. Related terms: entity linking, coreference resolution. Example: Distinguishing “SNCF” as a railway operator versus a corporate subsidiary in a tax document. Challenges: limited context, overlapping entity names.
Neural Machine Translation (NMT) #
Deep‑learning approach to MT using encoder‑decoder architectures. Related terms: transformer, attention mechanism. Example: Translating French tax notices to English for multinational compliance teams. Challenges: rare legal terms, domain adaptation.
Neural Network #
Computation model inspired by the human brain, consisting of interconnected nodes. Related terms: deep learning, backpropagation. Example: A feed‑forward network predicting the probability of a tax audit based on transaction attributes. Challenges: overfitting, interpretability.
Noise‑Robust Training #
Techniques that improve model resilience to erroneous or noisy inputs. Related terms: data augmentation, regularisation. Example: Training an NLP model on OCR‑derived tax documents that contain misspellings. Challenges: balancing robustness with precision.
OCR (Optical Character Recognition) #
Technology that converts scanned images of text into machine‑readable characters. Related terms: image preprocessing, handwriting recognition. Example: Extracting VAT numbers from scanned receipts. Challenges: low‑resolution scans, complex layouts.
Ontology Alignment #
Process of mapping concepts from different ontologies to each other. Related terms: semantic mapping, schema matching. Example: Aligning a French tax ontology with an EU‑wide fiscal ontology for data interoperability. Challenges: differing granularity, term equivalence.
Outlier Detection #
Identifying data points that deviate markedly from the norm. Related terms: anomaly detection, robust statistics. Example: Flagging unusually high VAT refunds for further audit. Challenges: defining normal thresholds, false alarms.
Parallel Corpus #
Collection of texts in two languages that are translations of each other. Related terms: bilingual data, alignment. Example: Using a French‑English parallel corpus to train a legal‑domain MT model. Challenges: scarcity of high‑quality legal parallel data.
Parsing (Syntax) #
Analyzing the grammatical structure of a sentence. Related terms: dependency parsing, constituency tree. Example: Parsing a tax clause to identify the subject, predicate, and conditional phrase. Challenges: complex legal syntax, long sentences.
Penalty Calculation Engine #
Software that computes statutory penalties based on violations. Related terms: interest accrual, late filing fee. Example: An engine that uses NLP‑extracted breach dates to calculate applicable penalties. Challenges: handling multiple penalty regimes, rounding rules.
Performance Metrics #
Quantitative measures used to evaluate model effectiveness. Related terms: F1 score, AUC. Example: Reporting precision, recall, and F1 for a tax‑document classification model. Challenges: selecting metrics aligned with business impact.
Personal Data #
Any information relating to an identified or identifiable natural person. Related terms: GDPR, data subject. Example: Redacting taxpayer names from a public dataset used for model training. Challenges: balancing utility with privacy, consent management.
Phishing Detection #
Identifying fraudulent emails that aim to steal credentials. Related terms: spam filter, malware. Example: An NLP filter that blocks tax‑related phishing attempts targeting accountants. Challenges: evolving tactics, false negatives.
Pipeline Orchestration #
Coordinating multiple processing steps in a data workflow. Related terms: Airflow, Dag. Example: Orchestrating OCR → tokenisation → NER → storage for tax document processing. Challenges: error handling, resource optimisation.
Post‑Training Quantisation #
Reducing model size by lowering numerical precision after training. Related terms: model compression, edge deployment. Example: Deploying a compact NLP model on a taxpayer‑portal server. Challenges: maintaining accuracy, hardware compatibility.
Precision #
Ratio of correctly predicted positive observations to total predicted positives. Related terms: accuracy, recall. Example: Measuring precision of an NLP model that flags “potential tax evasion” clauses. Challenges: high precision may reduce recall.
Privacy‑Preserving Machine Learning #
Techniques that protect sensitive data during model training. Related terms: federated learning, differential privacy. Example: Training a tax risk model across multiple firms without sharing raw data. Challenges: communication overhead, model convergence.
Probabilistic Topic Model #
Statistical model that discovers abstract topics in a collection of documents. Related terms: LDA, latent Dirichlet allocation. Example: Identifying dominant tax themes (e.g., “digital services,” “transfer pricing”) in rulings. Challenges: choosing number of topics, interpretability.
Prompt Engineering #
Crafting inputs to guide large language models toward desired outputs. Related terms: few‑shot learning, instruction tuning. Example: Designing prompts that elicit concise summaries of complex tax statutes from GPT‑style models. Challenges: prompt brittleness, version control.
Quantitative Risk Assessment #
Numerical evaluation of potential losses. Related terms: VaR, expected loss. Example: Estimating the monetary impact of a probable tax audit using historical data. Challenges: data availability, model assumptions.
RAG (Retrieval‑Augmented Generation) #
Architecture that combines external knowledge retrieval with generative language models. Related terms: knowledge‑grounded generation, contextualised LM. Example: A system that retrieves relevant tax articles before generating an answer to a user query. Challenges: retrieval latency, hallucination risk.
Regulatory Sandbox #
Controlled environment that allows experimentation with innovative technologies under regulator supervision. Related terms: pilot program, innovation hub. Example: Testing an AI‑driven tax compliance assistant with the French tax authority. Challenges: limited scope, compliance documentation.
Reinforcement Learning (RL) #
Learning paradigm where an agent interacts with an environment to maximise cumulative reward. Related terms: policy gradient, Q‑learning. Example: Training an RL agent to optimise audit selection to balance revenue and fairness. Challenges: reward design, exploration‑exploitation trade‑off.
Remote Sensing Data #
Information collected from satellites or aerial platforms. Related terms: geospatial analytics, GIS. Example: Using satellite imagery to verify the existence of a declared industrial site for tax purposes. Challenges: data resolution, integration with textual records.
Repository Pattern #
Software design pattern that abstracts data access. Related terms: DAO, data abstraction. Example: Implementing a repository that hides whether tax documents are stored in a relational DB or a document store. Challenges: performance tuning, consistency.
Rule‑Based System #
Engine that applies explicit IF‑THEN statements to derive conclusions. Related terms: expert system, knowledge base. Example: A rule‑based system that determines VAT applicability based on product categories. Challenges: rule maintenance, handling exceptions.
Scalable Vector Graphics (SVG) #
XML‑based format for two‑dimensional graphics. Related terms: vector image, web rendering. Example: Visualising tax‑risk heat maps as interactive SVGs on a compliance dashboard. Challenges: browser compatibility, data binding.
Semantic Search #
Retrieval technique that understands the meaning behind queries. Related terms: embedding, vector similarity. Example: Searching for “tax incentives for renewable energy” and retrieving relevant clauses even if phrasing differs. Challenges: embedding quality, latency.
Sentiment Analysis #
Process of determining the emotional tone behind words. Related terms: opinion mining, subjectivity detection. Example: Analyzing taxpayer feedback on the ease of filing to inform service improvements. Challenges: domain specificity, sarcasm detection.
Sequence‑to‑Sequence (Seq2Seq) #
Model architecture that maps an input sequence to an output sequence. Related terms: encoder‑decoder, attention. Example: Translating a French tax provision into plain English for layperson comprehension. Challenges: preserving legal precision, handling long inputs.
Service‑Oriented Architecture (SOA) #
Design paradigm where services communicate over a network. Related terms: microservices, SOAP. Example: Exposing a tax‑calculation service that can be called by external ERP systems. Challenges: versioning, security.
Shapley Value #
Concept from cooperative game theory that fairly distributes payoff among participants. Related terms: feature importance, explainability. Example: Using Shapley values to explain which document features most influenced a tax‑risk score. Challenges: computational cost, interpretability for non‑technical users.
Side‑Channel Attack #
Exploiting indirect information (e.g., timing, power consumption) to compromise a system. Related terms: cryptanalysis, hardware security. Example: Protecting an AI model that processes confidential tax data from timing attacks. Challenges: mitigation strategies, detection.
Similarity Metric #
Quantitative measure of likeness between two items. Related terms: cosine similarity, Jaccard index. Example: Measuring similarity between two tax rulings to suggest precedent. Challenges: selecting appropriate metric for legal text.
Single‑Sign‑On (SSO) #
Authentication method that permits a user to log in with one set of credentials across multiple applications. Related terms: OAuth, SAML. Example: Allowing tax consultants to access the AI compliance portal via corporate SSO. Challenges: federation trust, session management.
Soft‑Skill AI #
Artificial intelligence that assists with interpersonal aspects such as communication and negotiation. Related terms: conversational AI, empathetic bots. Example: A virtual assistant that explains tax obligations in a friendly tone. Challenges: cultural sensitivity, tone calibration.
Spacy #
Open‑source library for advanced NLP in Python. Related terms: tokeniser, pipeline. Example: Using Spacy’s French model to perform tokenisation and POS tagging on tax documents. Challenges: custom entity addition, version compatibility.
Statistical Significance #
Measure of whether an observed effect is likely due to chance. Related terms: p‑value, confidence interval. Example: Testing whether a new AI‑driven audit selection method yields a higher revenue capture rate. Challenges: sample size, multiple testing.
Stemming #
Reducing words to their root form. Related terms: lemmatisation, morphological analysis. Example: Converting “taxable,” “taxation,” and “taxes” to the stem “tax” for indexing. Challenges: over‑stemming leading to loss of meaning, language‑specific rules.
Supervised Learning #
Training models on labeled data to predict outcomes. Related terms: classification, regression. Example: Training a classifier to label documents as “tax‑relevant” using a manually annotated corpus. Challenges: label quality, class imbalance.
Tax Administration #
Government agency responsible for tax collection and enforcement. Related terms: direction générale des finances publiques (DGFiP), audit. Example: Deploying AI tools within the French tax administration to streamline audit selection. Challenges: legacy systems, change management.
Tax Avoidance #
Legal strategies used to minimise tax liabilities. Related terms: tax planning, anti‑abuse rule. Example: NLP detection of clauses that suggest aggressive tax planning. Challenges: distinguishing legitimate optimisation from abusive schemes.
Tax Credit #
Amount that reduces tax liability dollar‑for‑dollar. Related terms: tax incentive, reduction. Example: Extracting eligibility criteria for research‑and‑development credit from statutory text. Challenges: interpreting eligibility thresholds, interaction with other deductions.
Tax Evasion #
Illegal practice of not paying taxes owed. Related terms: fraud, concealment. Example: Using anomaly detection to uncover under‑reported income in electronic filings. Challenges: evidentiary standards, privacy constraints.
Tax Information Exchange Agreement (TIEA) #
Bilateral pact facilitating the exchange of tax‑relevant information. Related terms: OECD, information sharing. Example: NLP processing of exchanged reports to extract relevant taxpayer identifiers. Challenges: data format heterogeneity, confidentiality.
Tax Law Corpus #
Curated collection of statutes, regulations, case law, and commentary. Related terms: training data, knowledge base. Example: Building a French tax law corpus of 10,000 documents for language model fine‑tuning. Challenges: licensing, keeping the corpus up‑to‑date.
Taxonomy Alignment #
Mapping between different classification schemes. Related terms: ontology mapping, crosswalk. Example: Aligning the French NAF taxonomy with the EU‑wide NACE classification for reporting purposes. Challenges: divergent granularity, semantic gaps.
Temporal Reasoning #
Understanding and processing time‑related information. Related terms: time‑line extraction, event sequencing. Example: Determining the effective date of a tax amendment from legislative text. Challenges: ambiguous date expressions, relative temporal references.
Tokenisation #
Splitting text into smaller units such as words or sub‑words. Related terms: sentence segmentation, byte‑pair encoding. Example: Tokenising a French tax decree before feeding it to a transformer model. Challenges: handling compound words, punctuation.
Transfer Learning #
Reusing a pre‑trained model on a new, related task. Related terms: fine‑tuning, domain adaptation. Example: Adapting a generic French BERT model to the tax domain with a small annotated set. Challenges: catastrophic forgetting, over‑fitting.
Triple‑Store #
Database optimized for storing and retrieving RDF triples. Related terms: graph database, SPARQL. Example: Storing tax ontology statements in a triple‑store for semantic queries. Challenges: query performance, data ingestion pipelines.
Unstructured Data #
Information that does not follow a pre‑defined data model. Related terms: free text, PDF. Example: Processing scanned tax rulings that lack consistent fields. Challenges: extraction accuracy, metadata generation.
Unsupervised Learning #
Modeling techniques that find patterns without labeled outcomes. Related terms: clustering, dimensionality reduction. Example: Discovering hidden groupings of tax cases using autoencoders. Challenges: evaluation without ground truth, interpretability.
Version Control #
System for tracking changes to code and data. Related terms: Git, branching. Example: Managing iterations of the tax‑NLP model with Git to ensure reproducibility. Challenges: large binary files, data lineage.
Virtual Private Cloud (VPC) #
Isolated network environment within a public cloud. Related terms: network segmentation, security groups. Example: Deploying the AI tax compliance platform within a VPC to meet French data‑sovereignty requirements. Challenges: configuration complexity, cost monitoring.
Weighted Ensemble #
Combining multiple models where each contributes proportionally to its performance. Related terms: stacking, bagging. Example: Merging rule‑based, statistical, and deep‑learning predictions to improve tax risk scores. Challenges: correlation among models, ensemble maintenance.
Zero‑Shot Learning #
Ability of a model to perform tasks it has never seen during training. Related terms: prompting, few‑shot. Example: Asking a large language model to summarise a new tax amendment without explicit fine‑tuning. Challenges: reliability, domain specificity.