Airline Twitter Sentiment Analysis
Author
Pratik Savla
Date Published

Airline Twitter Sentiment Analysis: Mining Passenger Feedback Using Hive on Cloudera
Keywords: Sentiment Analysis, Hive, Cloudera, Big Data, Twitter, Airline Feedback, Data Analytics, NUS, HPE
As part of the Global Academic Internship Program at National University of Singapore (NUS), in collaboration with Hewlett Packard Enterprise (HPE), we undertook a real-world data analytics project focusing on sentiment analysis of airline passengers using Twitter data. The aim was to gain actionable insights into passenger sentiments and feedback toward major U.S. airlines.
We leveraged Apache Hive on the Cloudera big data platform to process and analyze the large-scale dataset efficiently.
Project Objective
Airlines receive constant feedback on Twitter—ranging from praise and complaints to questions and suggestions. Manually analyzing these millions of tweets is not feasible. Our goal was to:
- Classify passenger sentiments (positive, negative, neutral).
- Identify common issues faced by passengers.
- Understand how sentiment varies across different airlines.
Tools and Technologies
- Hive: Used for querying and analyzing large volumes of structured data.
- Cloudera: Enterprise data platform to manage and process big data at scale.
- Twitter US Airline Dataset: Public dataset containing thousands of tweets directed at U.S. airlines, each labeled with sentiment and reason.
Data Pipeline
1. Data Ingestion: The dataset was imported into Cloudera’s HDFS environment.
2. Schema Creation: We defined tables in Hive to structure the tweet data, including fields like tweet ID, airline name, sentiment label, text, and reasons.
3. Querying with HiveQL: Hive was used to perform sentiment breakdowns, top complaint reasons, and airline-wise comparisons.
4. Visualization: Insights were exported for charting (outside Hive) to identify patterns and trends visually.
Key Findings
- Sentiment Distribution: The majority of tweets were negative, followed by neutral, with only a small percentage being positive.
- Top Complaints: Common issues included flight delays, customer service problems, lost baggage, and booking troubles.
- Airline Comparison:
- Some airlines had a relatively balanced sentiment spread.
- Others showed disproportionately high negative sentiment, highlighting potential areas for service improvement.
- Time of Day Analysis: Tweets posted during flight hours (early morning to late night) showed a higher tendency toward negative sentiment.
Challenges Faced
- Data Cleaning: Tweets are unstructured and noisy; preprocessing was essential before Hive analysis.
- Scalability: Handling large datasets required optimizing Hive queries and using partitioning for faster results.
- Subjectivity of Sentiment: While labels existed, understanding the nuanced tone of a tweet sometimes required deeper linguistic analysis.
Conclusion
This project highlighted the value of big data analytics in the aviation industry. By analyzing large-scale social media data using Hive on Cloudera, we could uncover rich, real-time insights into customer experience—allowing airlines to adapt and improve service quality proactively.
The internship offered practical exposure to enterprise-scale data platforms, reinforced core concepts in data engineering and analytics, and emphasized the importance of real-time sentiment analysis in customer-facing industries.