Prompts for
OpenAI Codex
Identify and highlight all missing values in this dataset. Fill missing numerical values using the median of the column. For missing text values, replace with 'Not Available'. Suggest the most likely values for missing entries based on existing patterns.
Find and delete duplicate rows while keeping the first occurrence. Identify near-duplicate records (e.g., customer names or emails with minor spelling differences) and suggest merging strategies. Remove duplicate entries while preserving the most recent record.
Detect outliers in the 'income' column using the Z-score method. Replace values with a Z-score greater than 3 with the column median. Ensure all values in the 'age' column are non-negative and remove any invalid entries.
Convert the 'gender' column to a categorical type. Apply one-hot encoding to the 'location' column. Normalize 'feature1' and 'feature2' using standard scaling.
I have JSON-formatted survey data with feedback from different technology courses (Python, SQL, R, etc.). Help me identify significant differences in sentiment, strengths, and weaknesses across these course types. Focus your analysis on: 1. Which technology has the highest overall satisfaction and why? 2. Are there common weaknesses that appear across multiple technologies? 3. Do completion rates correlate with overall ratings? 4. What are the unique strengths of each technology course? Provide your analysis in a structured format with headings for each question, and include specific evidence from the data to support your findings.
Identify and highlight all missing values in this dataset. Then, fill missing numerical values using the median of the column, and replace all missing text values with 'Not Available'. If possible, suggest the most likely values for missing entries based on existing patterns.
Find and delete duplicate rows in this dataset while keeping the first occurrence. Also, identify near-duplicate customer records and suggest merging strategies. Highlight rows where the same email appears more than once, and find duplicate product names with slight spelling variations.
I have a CSV file [insert CSV file] containing sales transaction data from multiple store locations (columns: TransactionID, StoreID, SaleDate, ProductID, Quantity, and Price). Some rows have missing or incorrect StoreIDs, and some of the prices look off. Please outline a step-by-step approach to identify and handle these discrepancies, and provide sample Python code for cleaning tasks like removing or imputing missing StoreIDs and fixing price outliers.
I have JSON-formatted survey data with feedback from different technology courses (Python, SQL, R, etc.). Help me identify significant differences in sentiment, strengths, and weaknesses across these course types. Focus your analysis on: 1. Which technology has the highest overall satisfaction and why? 2. Are there common weaknesses that appear across multiple technologies? 3. Do completion rates correlate with overall ratings? 4. What are the unique strengths of each technology course? Provide your analysis in a structured format with headings for each question, and include specific evidence from the data to support your findings.
Given a dataset, write Python code using pandas to: 1. Fill missing numerical values with the column median. 2. Replace outliers in the 'income' column (using Z-score > 3) with the median. 3. Normalize 'feature1' and 'feature2' using StandardScaler. 4. Convert the 'gender' column to categorical type. 5. One-hot encode the 'location' column. 6. Remove rows where 'age' is negative.