Sentiment Analysis and Topic Trends in Endometriosis Reddit Communities Using LLMs
Hey everyone! I am an AI Consultant who was recently diagnosed with endometriosis, so the next thing I did was immediately look for as many resources as I could get my hands on to read up on this chronic illness. Unfortunately, Endometriosis didn’t seem to be a priority for society even though 10% of women suffer from it. The funding for Endometriosis is negligible compared to many diseases which affect far fewer people.
This article seeks to introduce everyone to this disease by exploring it through the lens of Large Language Models, while also examining the evolution of Natural Language Processing over time. Additionally, we aim to answer the following questions using the available data:
1. What can we use or do day to day to improve our quality of life with Endometriosis?
2. If someone is undergoing a surgery, how should they prepare for it and what should they do for a speedy recovery?
Endometriosis is a chronic estrogen-dependent chronic condition characterized by the ectopic implantation of functional tissue lining the uterus (endometrial glands and stroma) outside of the uterine cavity. Endometriosis, a word derived from the Greek endo ‘’inside’’, metra ‘’uterus’’ and osis ‘’ disease,’’ remains to some extent vague, with the most common clinical symptoms being pelvic pain and infertility. Most frequently, endometrial tissue is found in ovaries, resulting in the formation of chocolate cysts, but it can also be found in the Fallopian tubes, uterosacral ligaments, the gastrointestinal tract, and less often in the pleura, pericardium, or the central nervous system.
-https://www.ncbi.nlm.nih.gov/books/NBK567777/
I turned towards Reddit for advice and finding people who are suffering through pain and other symptoms and what they are doing about it.
Why Reddit? Users are anonymous, giving better insights into their true day-to-day.
To analyze the Reddit posts I needed the data. I got a data dump of the r/Endo and r/endometriosis Subreddits till Dec 2022 from here. In 2023, Reddit started charging for its API so I didn’t explore the data beyond 2022.
Next, I did some data cleaning. The data had separate files for comments and for posts. I combined them using the parent_id of the submission/post and the comments. So, each comment has its own row while the posts might be repeated if they have more than one comment. This was my first time working with Reddit data so I used the comment below from a Reddit post to figure out the matching between posts and comments.
All comments have a
parent_id
field, which is a "fullname". Fullnames start with t1_ if the object is a comment and t3_ if the object is a submission. So this comment I'm making will have aparent_id
oft3_171bn9m
, which means the object it's replying to is your submission, whose id is171bn9m
. If you reply to my comment, your comment will have aparent_id
oft1_171bn9m
, because my comment has an id of171bn9m
.
In the above picture, the highlighted columns that link the post data frame and the comments data frame. The yellow line partitions two different posts. The columns “title” and “selftext” correspond to the post and “body” corresponds to the comment.
Why LLMs?
LLMs give us the ability to understand and generate natural language. In this project, we used these tools to analyze the sentiments of individuals suffering from endometriosis, identify high-level topics of discussion, and answer questions that I, as an endometriosis sufferer, am curious about. Additionally, we explored word clouds to uncover any new insights.
- Nomic embeddings model, nomic-embed-text-v1.5, for creating the embeddings
- Atlas for visualizing the clusters
- Model mixtral-8x22b-instruct for naming the clusters
- Model GPT4-o for answer generation
- NLTK for the word clouds
- UMAP for dimensionality reduction
There were 40158 unique posts in the Endo Subreddit and 23506 in the Endometriosis Subreddit. To reduce the number of data points and also to improve the quality of the results we focused on the top 25% highest scoring posts from each of the Subreddits, which gave us a total of 16,842 data points to work with. The score field is provided in the dataset. Surprisingly, among the 16,842 posts there were duplicates, posts that were made to Endo Subreddit as well as Endometriosis. Maybe not so surprising, if you think about it, if I want an answer, I would want to increase my chances of getting a response by posting on all the Subreddits related to my topic.
Why embeddings?
Embeddings are a great way to capture the information in a text while converting it into a series of numbers that can be visualized more easily.
For generating the embeddings we concatenated the post title and post text and used the Nomic embedder through Huggingface and visualized them using Atlas.
Cluster Visualization
It found 256 clusters and if we switch to the broad category there are 8 clusters but as expected they are very generic. So, I decided to focus on the 256 clusters and try to find topics that are relevant to people who might be suffering with Endometriosis. Here is a link to the visualization to explore more.
This is what the clusters look like, a bit messy but when you zoom in and start focusing on the data points it gets very interesting, like the clusters below
The blue ones are about “Bloating due to Endometriosis” while the pink ones are about “Clothing that will help with endo/bloated tummy”. I have added a few example data points in the table below
Exploring Cluster Topics
Since we had 256 clusters, I wanted to automate the task of assigning a topic to each cluster. I decided to use an LLM for that, gave it 30 examples from each cluster and asked it to assign a name. In the beginning I tried using GPT4-o but to save costs switched to mixtral-8x22b-instruct through OpenRouter- a common interface for using closed source and open source models, which performed at par for this task. The results were not that great in the beginning but a little bit of prompt engineering helped there, like “make sure the cluster topic is at least 3 words long”. Also, I used the instructor library to make sure the response was the expected JSON, returning all the required fields. We got 256 cluster topics which is a lot to list here but I plotted them on a graph for visualization, used Nomic for the embeddings and then UMAP for dimensionality reduction.
If you hover over the cluster on the left, you can see that it focused on Laparoscopy/ surgery. While, if you hover over near (10.4, 3.5), it focuses on frustration with doctors, some of them dismissing the patient’s symptoms and having a lack of awareness about this disease. Try to hover around and explore what topics have people been discussing on these subreddits to learn about Endometriosis.
Sentiment
I wasn’t sure how much diversity to expect in terms of sentiments from each of the clusters but I did want to explore this.
For each of the clusters, when I made a request to get the cluster topic by giving the model 30 example posts from the cluster I also asked it about the sentiment for the cluster from the following
- ANXIOUS
- WORRIED
- FRUSTRATED
- ANGRY
- NEUTRAL
- SAD
- HAPPY
- OTHER
Surprisingly, there were a bunch of clusters which had a “Happy” sentiment. Yes, this process is not foolproof, because I gave it 30 examples to judge the sentiment and used it to define the sentiment of the whole cluster but it did perform well enough to get an overall idea and see if any of the clusters were any different from the others sentiment wise.
Not surprisingly, the majority of posts reflected frustration. There was only one cluster in the “OTHER” category, the topic of the cluster is “Peculiar abdominal sensations and bloating”, most probably suitable for a curious sentiment. Most of the “HAPPY” clusters are either validation that all the pain they had experienced had a reason behind it after they were diagnosed with Endometriosis or some sort of solution to deal with Endometriosis: “HAPPY_First Lap Experience and Endometriosis Confirmation” , “Comforting Companionship During Endo Flare-ups”, “Patients finding relief and validation with specialized care”, “Comfortable Clothing for Endo Belly”……..
WordCloud
The final step was to see if we could find any terminologies which were interesting and more frequently used in one cluster compared to others. For this, we created word clouds for each cluster using the post title, post text and also the comments corresponding to the post belonging to a cluster.
We had to do a bit of cleanup before creating the word clouds like removing numbers, URLs, and stopwords as they were becoming prominent in the word cloud without adding any value. The other issue with word clouds is that they don’t show the context or the order of the words, so there are only limited insights you can gain from it, larger words correspond to higher frequency in the text.
Crowdsourcing Experiences
I cannot stress enough how much insight we can gain by learning from other people’s experiences, and Reddit is a treasure trove where people are really helpful and open about their experiences. Hence, I particularly looked into the data for the two problems and the potential solutions/advice.
Problem 1: What can we use or do day to day to improve our quality of life with Endometriosis?
Problem 2: If someone is undergoing a surgery, how should they prepare for it and what should they do for a speedy recovery?
I picked out the relevant clusters for each problem and tried feeding the post and comments from those clusters into a Large Language Model to get the top advice. Unfortunately, it hit the context limit for GPT4-o, so I tried Gemini-1.5-pro which kept giving a 429 error. Then I tried Claude-3.5-sonnet with 200k context length, it did process the text but kept generating irrelevant text regarding Endometriosis, nothing related to the advice I asked for. I ended up breaking down the amount of text I was inputting at a time and made multiple calls using GPT4-o.
Answer 1
- Pelvic Floor Physical Therapy: Many users have found significant relief from pelvic floor physical therapy. It can help with pain management, improve muscle function, and address issues like painful sex and bowel movements.
- Heating Pads and Electric Blankets: Heating pads are a common recommendation for managing endometriosis pain. Some users also suggest using electric blankets for more extensive coverage and relief.
- CBD and THC Products: Several users have reported that CBD oil, THC edibles, and cannabis products help manage their pain and improve their quality of life.
- Dietary Changes: Cutting out certain foods like soy, gluten, dairy, and red meat, and following an anti-inflammatory diet has helped some users manage their symptoms better. Adopting a low FODMAP diet. Keep a food diary to track what you eat and note any symptoms that follow. This can help identify specific food triggers that exacerbate endometriosis symptoms. Some users found relief from symptoms through intermittent fasting, which may help reduce inflammation and improve digestive health.
- Acupuncture, Acupressure and Herbal medicine: Many users have found relief through acupuncture and acupressure, which can help reduce pain and improve overall well-being, also specific Chinese herbs have been beneficial for some in managing pain and improving quality of life.
- Supplements: Taking supplements like magnesium, calcium, vitamin D, and B12 can help manage symptoms. Omega-3 fatty acids, turmeric, ginger are also recommended for their anti-inflammatory properties.
- TENS Units: Transcutaneous Electrical Nerve Stimulation (TENS) units have been effective for some users in managing pain, especially when used on the lower back and abdomen.
- Supportive Clothing and Accessories: Items like high-waisted leggings, compression socks, and heating pads that can be worn under clothing have been helpful for managing pain discreetly throughout the day.
- Regular Exercise: Gentle exercises like yoga, stretching, and walking can help reduce pain and improve overall well-being. Avoid high-intensity workouts that may exacerbate symptoms.
- Stress Management: Techniques such as mindfulness, meditation, and deep breathing exercises can help manage stress, which in turn can reduce pain and flare-ups.
- Medical Treatments: Medications like Orilissa, Lupron, and birth control pills can help manage symptoms. However, it’s important to discuss potential side effects with a healthcare provider.
Answer 2:
- Prepare Your Home and Meals: Clean your living space, change your sheets, and prepare easy-to-digest meals like soups, broths, and smoothies. Stock up on high-fiber foods and stool softeners to prevent constipation post-surgery.
- Comfortable Clothing: Wear loose, comfortable clothing to the hospital and have similar options ready for recovery. High-waisted or maternity underwear, loose-fitting pants, and nightgowns are recommended to avoid pressure on incisions.
- Pain Management: Stay ahead of the pain by taking prescribed medications on schedule. Have over-the-counter options like Gas-X for gas pain and stool softeners to ease bowel movements. This may include prescribed painkillers like oxycodone initially, followed by over-the-counter options like paracetamol and ibuprofen. Be prepared to manage gas pain with Gas-X and peppermint tea.
- Hydration and Nutrition: Drink plenty of water and have hydrating options like Gatorade or Pedialyte. Keep easy-to-eat, nutrient-rich foods and snacks like protein shakes, applesauce, and crackers on hand.
- Post-Surgery Essentials: Bring a small pillow for the car ride home to protect your abdomen from the seatbelt. Have a heating pad for shoulder pain caused by the gas used during surgery and ice packs for swelling.
- Personal Care Items: Stock up on pads or period underwear for post-surgery bleeding. Have hygiene wipes and a peri bottle for gentle cleaning, especially if you have a catheter.
- Entertainment and Comfort: Prepare a list of movies, TV shows, books, or hobbies to keep you entertained during recovery. Set up a comfortable recovery area with pillows, blankets, and all necessary items within reach.
- Medical Supplies: Have extra bandages, antiseptic cream, and any prescribed medications ready. Consider getting a belly band for additional support once you start moving around more.
- Support System: Arrange for someone to stay with you for the first few days post-surgery to help with daily tasks and ensure you have everything you need. If you have pets, consider boarding them or arranging for someone to help care for them.
- Post-Op Care: Follow your doctor’s instructions closely, including any restrictions on lifting or physical activity. Keep track of your symptoms and pain levels, and don’t hesitate to contact your doctor if you experience severe pain, fever, or signs of infection.
In comparison, here is the answer to problem 2 I receive from ChatGPT, yes of course you can ask ChatGPT this but when we give these models context that is data/text that they can use to answer a question it makes the response more reliable and also at least in the case includes useful tips instead of being hand wavy like the answer below.
Prompt used: Give top 10 useful advice that can help someone suffering from endometeriosis undergoing a surgery soon. Respond with ONLY the list of top 10 useful advice that will help someone before and after the surgery
- Research your surgery and ask your doctor detailed questions about the procedure.
2. Arrange for post-surgery support and help at home for at least a week.
3. Stock up on easy-to-prepare, nutritious meals and snacks.
4. Prepare a comfortable recovery area with pillows, blankets, and entertainment.
5. Follow all pre-surgery instructions, including fasting and medication guidelines.
6. Stay hydrated and maintain a balanced diet leading up to the surgery.
7. Plan for gentle physical activity and short walks post-surgery to aid recovery.
8. Monitor your pain levels and take prescribed pain medications as directed.
9. Attend all follow-up appointments and communicate any concerns to your healthcare provider.
10. Rest as much as needed and avoid heavy lifting or strenuous activities during recovery.
Takeaways
Technical:
- GenAI has brought in a new era of NLP where we don’t have to focus on statistical insights/frequency of words in a text, we can actually ask the model to look at a text, understand it, categorize it and answer questions about it.
- The word cloud technique that we used above seems very primitive to other techniques because there is no understanding of the text, just counting of words.
- What you get out depends on what you put in. We can ask ChatGPT for advice on endometriosis straight up, but the output lacks the same specificity, and it’s harder to rely on, because we don’t know what all it may be using as the source. By seeding it with the data from Reddit we can trust the output much more because we know what the source is.
Societal:
- It is a shame that in the year 2024, there is still not enough consideration about health issues affecting women. There is a severe lack of research and funding for Endometriosis, which can actually be multiple diseases which we have given a single name because we don’t know enough.
- It is important for us as a society to NOT ignore anyone’s pain. There is a tendency for people to ignore a woman’s pain and not take them seriously which also leads to an average delay of 7 years in diagnosis of Endometriosis. In some countries, doctors still look for a male companion to talk to instead of talking to the woman whose body is suffering.
- Workplaces need to know more about these chronic illnesses and understand someone can be severely fatigued and they need the day off to feel like themselves again and perform at their best. A little help goes a long way.