ChatGPT讓數據分析更加簡單

簡介

翻閱堆積如山的分析數據而沒有任何真正的解決方案？ChatGPT就是要改變這種情況。憑藉其先進的自然語言處理能力，ChatGPT可以在你的數據中發現隱藏的模式和趨勢，這是你從未想過的。
在這篇博文中，我們將探討ChatGPT如何通過探索性數據分析徹底改變你的數據，並改變你對數據分析的看法。

提示是ChatGPT的關鍵

我意識到，要想充分發揮ChatGPT的潛力，提示是非常關鍵的。儘管ChatGPT能夠執行任何任務，但為了充分使用它，我們需要提供正確和詳細的提示。如果沒有確切的提示，你將無法獲得預期的結果。
我正在做這個實驗，看看ChatGPT是否真的能從數據集中獲得意義。我知道ChatGPT可以為我提供某些任務的代碼片段。
例如，給出一個提示 "幫我提供檢查異常值的代碼片段"。 ChatGPT為我提供了一個檢查和識別離群值的代碼片段。但是，ChatGPT能不能幫助我回答一些問題，比如確定數據集中包含離群值的列？或者目標變量和特徵之間的相關係數是多少？
為了回答這些問題，ChatGPT必須對數據集中的特定列進行分析，並進行數學運算，以得出答案。

使用ChatGPT進行探索性數據分析（EDA）

讓我們嘗試一些提示，使用ChatGPT進行EDA：
Prompt 1:
I want you to act as a data scientist and analyze the dataset. Provide me with the exact and definitive answer for each question. Do not provide me with the code snippets for the questions. The dataset is provided below. Consider the given dataset for analysis. The first row of the dataset contains the header.
回答：

Sure, I can help you analyze the dataset. 
Please provide me with the dataset so that I can begin the analysis.
txt

Prompt 2:

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,0,3,”Braund, Mr. Owen Harris”,male,22,1,0,A/5 21171,7.25,,S
2,1,1,”Cumings, Mrs. John Bradley (Florence Briggs Thayer)”,female,38,1,0,PC 17599,71.2833,C85,C
3,1,3,”Heikkinen, Miss. Laina”,female,26,0,0,STON/O2. 3101282,7.925,,S
4,1,1,”Futrelle, Mrs. Jacques Heath (Lily May Peel)”,female,35,1,0,113803,53.1,C123,S
5,0,3,”Allen, Mr. William Henry”,male,35,0,0,373450,8.05,,S
6,0,3,”Moran, Mr. James”,male,,0,0,330877,8.4583,,Q  
txt

回答：

Thank you for providing the dataset.   
I will analyze it based on the given information. Here are the answers to your questions:
1. How many columns are there in the dataset?
   Answer: There are 12 columns in the dataset.
2. How many rows are there in the dataset?
   Answer: There are 6 rows in the dataset, excluding the header.
3. What are the names of the columns in the dataset?
   Answer: The names of the columns in the dataset are:
   - PassengerId
   - Survived
   - Pclass
   - Name
   - Sex
   - Age
   - SibSp
   - Parch
   - Ticket
   - Fare
   - Cabin
   - Embarked
4. What is the data type of the 'Survived' column?
   Answer: The data type of the 'Survived' column is numeric (integer).
5. What is the minimum and maximum age of the passengers?
   Answer: The minimum age of the passengers is 22, and the maximum age is 38.
6. How many unique values are there in the 'Embarked' column?
   Answer: There are 3 unique values in the 'Embarked' column.
7. What is the most common value in the 'Pclass' column?
   Answer: The most common value in the 'Pclass' column is 3.
8. How many passengers survived (Survived = 1) according to the dataset?
   Answer: According to the dataset, 3 passengers survived.
Please let me know if you have any more questions or if there's anything else I can assist you with!
txt

Prompt 3:
How many rows and columns are present in the dataset?
回答： 问答.png Prompt 4:
List down the numerical and categorical columns
回答：

Based on the dataset you provided, here are the numerical and categorical columns:
Numerical columns:
- PassengerId
- Survived
- Pclass
- Age
- SibSp
- Parch
- Fare
Categorical columns:
- Name
- Sex
- Ticket
- Cabin
- Embarked
Please note that the "Survived" column, although represented by numbers (0 and 1), is considered as a categorical column since it represents a binary category (0 = Not survived, 1 = Survived).
Let me know if there's anything else I can assist you with!
txt

Prompt 5:
Check for NANs present in the dataset? If yes, print no. of nans in each column.
回答：
Prompt 6:
Determine the columns that follow the skewed distribution and name them.
回答：

To determine the columns that follow a skewed distribution,  
we can calculate the skewness measure for each numerical column in the dataset.   
Skewness measures the asymmetry of a distribution.   
If the skewness value is greater than 0, it indicates a right-skewed distribution (tail on the right),   
while a skewness value less than 0 indicates a left-skewed distribution (tail on the left).
Here are the columns that follow a skewed distribution based on the dataset you provided:
- Age: Skewed distribution (right-skewed)
- Fare: Skewed distribution (right-skewed)
Please note that the skewness assessment is based on the given dataset. If you have additional columns or a different dataset, the results may vary.
If you have any further questions or if there's anything else I can assist you with, please let me know!
txt

Prompt 6:
Generate meaningful insights about the dataset.
回答：