Javascript required
Skip to content Skip to sidebar Skip to footer

Find Oh for the Fourth Row Solution

101 python pandas exercises are designed to challenge your logical muscle and to help internalize data manipulation with python's favorite package for data analysis. The questions are of 3 levels of difficulties with L1 being the easiest to L3 being the hardest.

101 Pandas Exercises. Photo by Chester Ho. You might also like to practice the 101 NumPy exercises, they are often used together.

Create Powerful Visualizations using Python with my FREE 9-Day-Video-Course.

1. How to import pandas and check the version?

Show Solution

            import numpy as np  # optional import pandas as pd print(pd.__version__) print(pd.show_versions(as_json=True))                      
            0.20.3 {'system': {'commit': None}, 'dependencies': {'pandas': '0.20.3', 'pytest': '3.2.1', 'pip': '9.0.1', 'setuptools': '36.5.0.post20170921', 'Cython': '0.26.1', 'numpy': '1.13.3', 'scipy': '0.19.1', 'xarray': None, 'IPython': '6.1.0', 'sphinx': '1.6.3', 'patsy': '0.4.1', 'dateutil': '2.6.1', 'pytz': '2017.2', 'blosc': None, 'bottleneck': '1.2.1', 'tables': '3.4.2', 'numexpr': '2.6.2', 'feather': None, 'matplotlib': '2.1.0', 'openpyxl': '2.4.8', 'xlrd': '1.1.0', 'xlwt': '1.2.0', 'xlsxwriter': '1.0.2', 'lxml': '4.1.0', 'bs4': '4.6.0', 'html5lib': '0.999999999', 'sqlalchemy': '1.1.13', 'pymysql': None, 'psycopg2': None, 'jinja2': '2.9.6', 's3fs': None, 'pandas_gbq': None, 'pandas_datareader': None}} None                      

2. How to create a series from a list, numpy array and dict?

Create a pandas series from each of the items below: a list, numpy and a dictionary Input

          import numpy as np mylist = list('abcedfghijklmnopqrstuvwxyz') myarr = np.arange(26) mydict = dict(zip(mylist, myarr))                  

Create Powerful Visualizations in Python: Sign up for my free 9-day-email course here.

Show Solution

            # Inputs import numpy as np mylist = list('abcedfghijklmnopqrstuvwxyz') myarr = np.arange(26) mydict = dict(zip(mylist, myarr))  # Solution ser1 = pd.Series(mylist) ser2 = pd.Series(myarr) ser3 = pd.Series(mydict) print(ser3.head())                      
            a    0 b    1 c    2 d    4 e    3 dtype: int64                      

3. How to convert the index of a series into a column of a dataframe?

Difficulty Level: L1 Convert the series ser into a dataframe with its index as another column on the dataframe. Input

          mylist = list('abcedfghijklmnopqrstuvwxyz') myarr = np.arange(26) mydict = dict(zip(mylist, myarr)) ser = pd.Series(mydict)                  

Show Solution

            # Input mylist = list('abcedfghijklmnopqrstuvwxyz') myarr = np.arange(26) mydict = dict(zip(mylist, myarr)) ser = pd.Series(mydict)  # Solution df = ser.to_frame().reset_index() print(df.head())                      
                          index  0 0     a  0 1     b  1 2     c  2 3     d  4 4     e  3                      

4. How to combine many series to form a dataframe?

Difficulty Level: L1 Combine ser1 and ser2 to form a dataframe. Input

          import numpy as np ser1 = pd.Series(list('abcedfghijklmnopqrstuvwxyz')) ser2 = pd.Series(np.arange(26))                  

Show Solution

            # Input import numpy as np ser1 = pd.Series(list('abcedfghijklmnopqrstuvwxyz')) ser2 = pd.Series(np.arange(26))  # Solution 1 df = pd.concat([ser1, ser2], axis=1)  # Solution 2 df = pd.DataFrame({'col1': ser1, 'col2': ser2}) print(df.head())                      
                          col1  col2 0    a     0 1    b     1 2    c     2 3    e     3 4    d     4                      

5. How to assign name to the series' index?

Difficulty Level: L1 Give a name to the series ser calling it 'alphabets'. Input

          ser = pd.Series(list('abcedfghijklmnopqrstuvwxyz'))                  

Show Solution

            # Input ser = pd.Series(list('abcedfghijklmnopqrstuvwxyz'))  # Solution ser.name = 'alphabets' ser.head()                      
            0    a 1    b 2    c 3    e 4    d Name: alphabets, dtype: object                      

6. How to get the items of series A not present in series B?

Difficulty Level: L2 From ser1 remove items present in ser2.

          ser1 = pd.Series([1, 2, 3, 4, 5]) ser2 = pd.Series([4, 5, 6, 7, 8])                  

Show Solution

            # Input ser1 = pd.Series([1, 2, 3, 4, 5]) ser2 = pd.Series([4, 5, 6, 7, 8])  # Solution ser1[~ser1.isin(ser2)]                      
            0    1 1    2 2    3 dtype: int64                      

7. How to get the items not common to both series A and series B?

Difficulty Level: L2 Get all items of ser1 and ser2 not common to both. Input

          ser1 = pd.Series([1, 2, 3, 4, 5]) ser2 = pd.Series([4, 5, 6, 7, 8])                  

Show Solution

            # Input ser1 = pd.Series([1, 2, 3, 4, 5]) ser2 = pd.Series([4, 5, 6, 7, 8])  # Solution ser_u = pd.Series(np.union1d(ser1, ser2))  # union ser_i = pd.Series(np.intersect1d(ser1, ser2))  # intersect ser_u[~ser_u.isin(ser_i)]                      
            0    1 1    2 2    3 5    6 6    7 7    8 dtype: int64                      

8. How to get the minimum, 25th percentile, median, 75th, and max of a numeric series?

Difficuty Level: L2 Compute the minimum, 25th percentile, median, 75th, and maximum of ser. Input

          ser = pd.Series(np.random.normal(10, 5, 25))                  

Show Solution

            # Input state = np.random.RandomState(100) ser = pd.Series(state.normal(10, 5, 25))  # Solution np.percentile(ser, q=[0, 25, 50, 75, 100])                      
            array([  1.39267584,   6.49135133,  10.2578186 ,  13.06985067,  25.80920994])                      

9. How to get frequency counts of unique items of a series?

Difficulty Level: L1 Calculte the frequency counts of each unique value ser. Input

          ser = pd.Series(np.take(list('abcdefgh'), np.random.randint(8, size=30)))                  

Show Solution

            # Input ser = pd.Series(np.take(list('abcdefgh'), np.random.randint(8, size=30)))  # Solution ser.value_counts()                      
            f    8 g    7 b    6 c    4 a    2 e    2 h    1 dtype: int64                      

10. How to keep only top 2 most frequent values as it is and replace everything else as 'Other'?

Difficulty Level: L2 From ser, keep the top 2 most frequent items as it is and replace everything else as 'Other'. Input

          np.random.RandomState(100) ser = pd.Series(np.random.randint(1, 5, [12]))                  

Show Solution

            # Input np.random.RandomState(100) ser = pd.Series(np.random.randint(1, 5, [12]))  # Solution print("Top 2 Freq:", ser.value_counts()) ser[~ser.isin(ser.value_counts().index[:2])] = 'Other' ser                      
            Top 2 Freq: 4    5 3    3 2    2 1    2 dtype: int64  0     Other 1     Other 2         3 3         4 4     Other 5         4 6         4 7         3 8         3 9         4 10        4 11    Other dtype: object                      

11. How to bin a numeric series to 10 groups of equal size?

Difficulty Level: L2 Bin the series ser into 10 equal deciles and replace the values with the bin name. Input

          ser = pd.Series(np.random.random(20))                  

Desired Output

          # First 5 items 0    7th 1    9th 2    7th 3    3rd 4    8th dtype: category Categories (10, object): [1st < 2nd < 3rd < 4th ... 7th < 8th < 9th < 10th]                  

Show Solution

            # Input ser = pd.Series(np.random.random(20)) print(ser.head())  # Solution pd.qcut(ser, q=[0, .10, .20, .3, .4, .5, .6, .7, .8, .9, 1],          labels=['1st', '2nd', '3rd', '4th', '5th', '6th', '7th', '8th', '9th', '10th']).head()                      
            0    0.556912 1    0.892955 2    0.566632 3    0.146656 4    0.881579 dtype: float64  0    7th 1    9th 2    7th 3    3rd 4    8th dtype: category Categories (10, object): [1st < 2nd < 3rd < 4th ... 7th < 8th < 9th < 10th]                      

12. How to convert a numpy array to a dataframe of given shape? (L1)

Difficulty Level: L1 Reshape the series ser into a dataframe with 7 rows and 5 columns Input

          ser = pd.Series(np.random.randint(1, 10, 35))                  

Show Solution

            # Input ser = pd.Series(np.random.randint(1, 10, 35))  # Solution df = pd.DataFrame(ser.values.reshape(7,5)) print(df)                      
                          0  1  2  3  4 0  1  2  1  2  5 1  1  2  4  5  2 2  1  3  3  2  8 3  8  6  4  9  6 4  2  1  1  8  5 5  3  2  8  5  6 6  1  5  5  4  6                      

13. How to find the positions of numbers that are multiples of 3 from a series?

Difficulty Level: L2 Find the positions of numbers that are multiples of 3 from ser. Input

          ser = pd.Series(np.random.randint(1, 10, 7))                  

Show Solution

            # Input ser = pd.Series(np.random.randint(1, 10, 7)) ser  # Solution print(ser) np.argwhere(ser % 3==0)                      
            0    6 1    8 2    6 3    7 4    6 5    2 6    4 dtype: int64  array([[0],        [2],        [4]])                      

14. How to extract items at given positions from a series

Difficulty Level: L1 From ser, extract the items at positions in list pos. Input

          ser = pd.Series(list('abcdefghijklmnopqrstuvwxyz')) pos = [0, 4, 8, 14, 20]                  

Show Solution

            # Input ser = pd.Series(list('abcdefghijklmnopqrstuvwxyz')) pos = [0, 4, 8, 14, 20]  # Solution ser.take(pos)                      
            0     a 4     e 8     i 14    o 20    u dtype: object                      

15. How to stack two series vertically and horizontally ?

Difficulty Level: L1 Stack ser1 and ser2 vertically and horizontally (to form a dataframe). Input

          ser1 = pd.Series(range(5)) ser2 = pd.Series(list('abcde'))                  

Show Solution

            # Input ser1 = pd.Series(range(5)) ser2 = pd.Series(list('abcde'))  # Output # Vertical ser1.append(ser2)  # Horizontal df = pd.concat([ser1, ser2], axis=1) print(df)                      
                          0  1 0  0  a 1  1  b 2  2  c 3  3  d 4  4  e                      

16. How to get the positions of items of series A in another series B?

Difficulty Level: L2 Get the positions of items of ser2 in ser1 as a list. Input

          ser1 = pd.Series([10, 9, 6, 5, 3, 1, 12, 8, 13]) ser2 = pd.Series([1, 3, 10, 13])                  

Show Solution

            # Input ser1 = pd.Series([10, 9, 6, 5, 3, 1, 12, 8, 13]) ser2 = pd.Series([1, 3, 10, 13])  # Solution 1 [np.where(i == ser1)[0].tolist()[0] for i in ser2]  # Solution 2 [pd.Index(ser1).get_loc(i) for i in ser2]                      
            [5, 4, 0, 8]                      

17. How to compute the mean squared error on a truth and predicted series?

Difficulty Level: L2 Compute the mean squared error of truth and pred series. Input

          truth = pd.Series(range(10)) pred = pd.Series(range(10)) + np.random.random(10)                  

Show Solution

            # Input truth = pd.Series(range(10)) pred = pd.Series(range(10)) + np.random.random(10)  # Solution np.mean((truth-pred)**2)                      
            0.28448128110629545                      

18. How to convert the first character of each element in a series to uppercase?

Difficulty Level: L2 Change the first character of each word to upper case in each word of ser.

          ser = pd.Series(['how', 'to', 'kick', 'ass?'])                  

Show Solution

            # Input ser = pd.Series(['how', 'to', 'kick', 'ass?'])  # Solution 1 ser.map(lambda x: x.title())  # Solution 2 ser.map(lambda x: x[0].upper() + x[1:])  # Solution 3 pd.Series([i.title() for i in ser])                      
            0     How 1      To 2    Kick 3    Ass? dtype: object                      

19. How to calculate the number of characters in each word in a series?

Difficulty Level: L2 Input

          ser = pd.Series(['how', 'to', 'kick', 'ass?'])                  

Show Solution

            # Input ser = pd.Series(['how', 'to', 'kick', 'ass?'])  # Solution ser.map(lambda x: len(x))                      
            0    3 1    2 2    4 3    4 dtype: int64                      

20. How to compute difference of differences between consequtive numbers of a series?

Difficulty Level: L1 Difference of differences between the consequtive numbers of ser. Input

          ser = pd.Series([1, 3, 6, 10, 15, 21, 27, 35])                  

Desired Output

          [nan, 2.0, 3.0, 4.0, 5.0, 6.0, 6.0, 8.0] [nan, nan, 1.0, 1.0, 1.0, 1.0, 0.0, 2.0]                  

Show Solution

            # Input ser = pd.Series([1, 3, 6, 10, 15, 21, 27, 35])  # Solution print(ser.diff().tolist()) print(ser.diff().diff().tolist())                      
            [nan, 2.0, 3.0, 4.0, 5.0, 6.0, 6.0, 8.0] [nan, nan, 1.0, 1.0, 1.0, 1.0, 0.0, 2.0]                      

21. How to convert a series of date-strings to a timeseries?

Difficiulty Level: L2 Input

          ser = pd.Series(['01 Jan 2010', '02-02-2011', '20120303', '2013/04/04', '2014-05-05', '2015-06-06T12:20'])                  

Desired Output

          0   2010-01-01 00:00:00 1   2011-02-02 00:00:00 2   2012-03-03 00:00:00 3   2013-04-04 00:00:00 4   2014-05-05 00:00:00 5   2015-06-06 12:20:00 dtype: datetime64[ns]                  

Show Solution

            # Input ser = pd.Series(['01 Jan 2010', '02-02-2011', '20120303', '2013/04/04', '2014-05-05', '2015-06-06T12:20'])  # Solution 1 from dateutil.parser import parse ser.map(lambda x: parse(x))  # Solution 2 pd.to_datetime(ser)                      
            0   2010-01-01 00:00:00 1   2011-02-02 00:00:00 2   2012-03-03 00:00:00 3   2013-04-04 00:00:00 4   2014-05-05 00:00:00 5   2015-06-06 12:20:00 dtype: datetime64[ns]                      

22. How to get the day of month, week number, day of year and day of week from a series of date strings?

Difficiulty Level: L2 Get the day of month, week number, day of year and day of week from ser. Input

          ser = pd.Series(['01 Jan 2010', '02-02-2011', '20120303', '2013/04/04', '2014-05-05', '2015-06-06T12:20'])                  

Desired output

          Date:  [1, 2, 3, 4, 5, 6] Week number:  [53, 5, 9, 14, 19, 23] Day num of year:  [1, 33, 63, 94, 125, 157] Day of week:  ['Friday', 'Wednesday', 'Saturday', 'Thursday', 'Monday', 'Saturday']                  

Show Solution

            # Input ser = pd.Series(['01 Jan 2010', '02-02-2011', '20120303', '2013/04/04', '2014-05-05', '2015-06-06T12:20'])  # Solution from dateutil.parser import parse ser_ts = ser.map(lambda x: parse(x))  # day of month print("Date: ", ser_ts.dt.day.tolist())  # week number print("Week number: ", ser_ts.dt.weekofyear.tolist())  # day of year print("Day number of year: ", ser_ts.dt.dayofyear.tolist())  # day of week print("Day of week: ", ser_ts.dt.weekday_name.tolist())                      
            Date:  [1, 2, 3, 4, 5, 6] Week number:  [53, 5, 9, 14, 19, 23] Day num of year:  [1, 33, 63, 94, 125, 157] Day of week:  ['Friday', 'Wednesday', 'Saturday', 'Thursday', 'Monday', 'Saturday']                      

23. How to convert year-month string to dates corresponding to the 4th day of the month?

Difficiulty Level: L2 Change ser to dates that start with 4th of the respective months. Input

          ser = pd.Series(['Jan 2010', 'Feb 2011', 'Mar 2012'])                  

Desired Output

          0   2010-01-04 1   2011-02-04 2   2012-03-04 dtype: datetime64[ns]                  

Show Solution

            import pandas as pd # Input ser = pd.Series(['Jan 2010', 'Feb 2011', 'Mar 2012'])  # Solution 1 from dateutil.parser import parse # Parse the date ser_ts = ser.map(lambda x: parse(x))  # Construct date string with date as 4 ser_datestr = ser_ts.dt.year.astype('str') + '-' + ser_ts.dt.month.astype('str') + '-' + '04'  # Format it. [parse(i).strftime('%Y-%m-%d') for i in ser_datestr]  # Solution 2 ser.map(lambda x: parse('04 ' + x))                      
            0   2010-01-04 1   2011-02-04 2   2012-03-04 dtype: datetime64[ns]                      

24. How to filter words that contain atleast 2 vowels from a series?

Difficiulty Level: L3 From ser, extract words that contain atleast 2 vowels. Input

          ser = pd.Series(['Apple', 'Orange', 'Plan', 'Python', 'Money'])                  

Desired Output

          0     Apple 1    Orange 4     Money dtype: object                  

Show Solution

            # Input ser = pd.Series(['Apple', 'Orange', 'Plan', 'Python', 'Money'])  # Solution from collections import Counter mask = ser.map(lambda x: sum([Counter(x.lower()).get(i, 0) for i in list('aeiou')]) >= 2) ser[mask]                      
            0     Apple 1    Orange 4     Money dtype: object                      

25. How to filter valid emails from a series?

Difficiulty Level: L3 Extract the valid emails from the series emails. The regex pattern for valid emails is provided as reference. Input

          emails = pd.Series(['buying books at amazom.com', '[email protected]', '[email protected]', '[email protected]']) pattern ='[A-Za-z0-9._%+-][email protected][A-Za-z0-9.-]+\\.[A-Za-z]{2,4}'                  

Desired Output

          1    [email protected] 2            [email protected] 3    [email protected] dtype: object                  

Show Solution

26. How to get the mean of a series grouped by another series?

Difficiulty Level: L2 Compute the mean of weights of each fruit. Input

          fruit = pd.Series(np.random.choice(['apple', 'banana', 'carrot'], 10)) weights = pd.Series(np.linspace(1, 10, 10)) print(weight.tolist()) print(fruit.tolist()) #> [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0] #> ['banana', 'carrot', 'apple', 'carrot', 'carrot', 'apple', 'banana', 'carrot', 'apple', 'carrot']                  

Desired output

          # values can change due to randomness apple     6.0 banana    4.0 carrot    5.8 dtype: float64                  

Show Solution

            # Input fruit = pd.Series(np.random.choice(['apple', 'banana', 'carrot'], 10)) weights = pd.Series(np.linspace(1, 10, 10))  # Solution weights.groupby(fruit).mean()                      
            apple     7.4 banana    2.0 carrot    6.0 dtype: float64                      

27. How to compute the euclidean distance between two series?

Difficiulty Level: L2 Compute the euclidean distance between series (points) p and q, without using a packaged formula. Input

          p = pd.Series([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]) q = pd.Series([10, 9, 8, 7, 6, 5, 4, 3, 2, 1])                  

Desired Output

          18.165                  

Show Solution

            # Input p = pd.Series([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]) q = pd.Series([10, 9, 8, 7, 6, 5, 4, 3, 2, 1])  # Solution  sum((p - q)**2)**.5  # Solution (using func) np.linalg.norm(p-q)                      
            18.165902124584949                      

28. How to find all the local maxima (or peaks) in a numeric series?

Difficiulty Level: L3 Get the positions of peaks (values surrounded by smaller values on both sides) in ser. Input

          ser = pd.Series([2, 10, 3, 4, 9, 10, 2, 7, 3])                  

Desired output

          array([1, 5, 7])                  

Show Solution

            # Input ser = pd.Series([2, 10, 3, 4, 9, 10, 2, 7, 3])  # Solution dd = np.diff(np.sign(np.diff(ser))) peak_locs = np.where(dd == -2)[0] + 1 peak_locs                      
            array([1, 5, 7])                      

29. How to replace missing spaces in a string with the least frequent character?

Replace the spaces in my_str with the least frequent character. Difficiulty Level: L2 Input

          my_str = 'dbc deb abed gade'                  

Desired Output

          'dbccdebcabedcgade'  # least frequent is 'c'                  

Show Solution

            # Input my_str = 'dbc deb abed gade'  # Solution ser = pd.Series(list('dbc deb abed gade')) freq = ser.value_counts() print(freq) least_freq = freq.dropna().index[-1] "".join(ser.replace(' ', least_freq))                      
            d    4 b    3 e    3      3 a    2 g    1 c    1 dtype: int64  'dbccdebcabedcgade'                      

30. How to create a TimeSeries starting '2000-01-01' and 10 weekends (saturdays) after that having random numbers as values?

Difficiulty Level: L2 Desired output

          # values can be random 2000-01-01    4 2000-01-08    1 2000-01-15    8 2000-01-22    4 2000-01-29    4 2000-02-05    2 2000-02-12    4 2000-02-19    9 2000-02-26    6 2000-03-04    6                  

Show Solution

            # Solution ser = pd.Series(np.random.randint(1,10,10), pd.date_range('2000-01-01', periods=10, freq='W-SAT')) ser                      
            2000-01-01    6 2000-01-08    7 2000-01-15    4 2000-01-22    6 2000-01-29    8 2000-02-05    6 2000-02-12    5 2000-02-19    8 2000-02-26    1 2000-03-04    7 Freq: W-SAT, dtype: int64                      

31. How to fill an intermittent time series so all missing dates show up with values of previous non-missing date?

Difficiulty Level: L2 ser has missing dates and values. Make all missing dates appear and fill up with value from previous date. Input

          ser = pd.Series([1,10,3,np.nan], index=pd.to_datetime(['2000-01-01', '2000-01-03', '2000-01-06', '2000-01-08'])) print(ser) #> 2000-01-01     1.0 #> 2000-01-03    10.0 #> 2000-01-06     3.0 #> 2000-01-08     NaN #> dtype: float64                  

Desired Output

          2000-01-01     1.0 2000-01-02     1.0 2000-01-03    10.0 2000-01-04    10.0 2000-01-05    10.0 2000-01-06     3.0 2000-01-07     3.0 2000-01-08     NaN                  

Show Solution

            # Input ser = pd.Series([1,10,3, np.nan], index=pd.to_datetime(['2000-01-01', '2000-01-03', '2000-01-06', '2000-01-08']))  # Solution ser.resample('D').ffill()  # fill with previous value  # Alternatives ser.resample('D').bfill()  # fill with next value ser.resample('D').bfill().ffill()  # fill next else prev value                      
            2000-01-01     1.0 2000-01-02    10.0 2000-01-03    10.0 2000-01-04     3.0 2000-01-05     3.0 2000-01-06     3.0 2000-01-07     3.0 2000-01-08     3.0 Freq: D, dtype: float64                      

32. How to compute the autocorrelations of a numeric series?

Difficiulty Level: L3 Compute autocorrelations for the first 10 lags of ser. Find out which lag has the largest correlation. Input

          ser = pd.Series(np.arange(20) + np.random.normal(1, 10, 20))                  

Desired output

          # values will change due to randomness [0.29999999999999999, -0.11, -0.17000000000000001, 0.46000000000000002, 0.28000000000000003, -0.040000000000000001, -0.37, 0.41999999999999998, 0.47999999999999998, 0.17999999999999999] Lag having highest correlation:  9                  

Show Solution

            # Input ser = pd.Series(np.arange(20) + np.random.normal(1, 10, 20))  # Solution autocorrelations = [ser.autocorr(i).round(2) for i in range(11)] print(autocorrelations[1:]) print('Lag having highest correlation: ', np.argmax(np.abs(autocorrelations[1:]))+1)                      
            [0.29999999999999999, -0.11, -0.17000000000000001, 0.46000000000000002, 0.28000000000000003, -0.040000000000000001, -0.37, 0.41999999999999998, 0.47999999999999998, 0.17999999999999999] Lag having highest correlation:  9                      

33. How to import only every nth row from a csv file to create a dataframe?

Difficiulty Level: L2 Import every 50th row of BostonHousing dataset as a dataframe. Show Solution

            # Solution 1: Use chunks and for-loop df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/BostonHousing.csv', chunksize=50) df2 = pd.DataFrame() for chunk in df:     df2 = df2.append(chunk.iloc[0,:])   # Solution 2: Use chunks and list comprehension df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/BostonHousing.csv', chunksize=50) df2 = pd.concat([chunk.iloc[0] for chunk in df], axis=1) df2 = df2.transpose()  # Solution 3: Use csv reader import csv           with open('BostonHousing.csv', 'r') as f:     reader = csv.reader(f)     out = []     for i, row in enumerate(reader):         if i%50 == 0:             out.append(row)  df2 = pd.DataFrame(out[1:], columns=out[0]) print(df2.head())                      
                          crim    zn  indus chas                  nox     rm   age  \ 0              0.21977   0.0   6.91    0  0.44799999999999995  5.602  62.0    1               0.0686   0.0   2.89    0                0.445  7.416  62.5    2   2.7339700000000002   0.0  19.58    0                0.871  5.597  94.9    3               0.0315  95.0   1.47    0  0.40299999999999997  6.975  15.3    4  0.19072999999999998  22.0   5.86    0                0.431  6.718  17.5           dis rad  tax ptratio       b  lstat  medv   0  6.0877   3  233    17.9   396.9   16.2  19.4   1  3.4952   2  276    18.0   396.9   6.19  33.2   2  1.5257   5  403    14.7  351.85  21.45  15.4   3  7.6534   3  402    17.0   396.9   4.56  34.9   4  7.8265   7  330    19.1  393.74   6.56  26.2                      

34. How to change column values when importing csv to a dataframe?

Difficulty Level: L2 Import the boston housing dataset, but while importing change the 'medv' (median house value) column so that values < 25 becomes 'Low' and > 25 becomes 'High'. Show Solution

            # Solution 1: Using converter parameter df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/BostonHousing.csv',                   converters={'medv': lambda x: 'High' if float(x) > 25 else 'Low'})   # Solution 2: Using csv reader import csv with open('BostonHousing.csv', 'r') as f:     reader = csv.reader(f)     out = []     for i, row in enumerate(reader):         if i > 0:             row[13] = 'High' if float(row[13]) > 25 else 'Low'         out.append(row)  df = pd.DataFrame(out[1:], columns=out[0]) print(df.head())                      
                          crim    zn indus chas                  nox  \ 0               0.00632  18.0  2.31    0   0.5379999999999999    1               0.02731   0.0  7.07    0                0.469    2               0.02729   0.0  7.07    0                0.469    3  0.032369999999999996   0.0  2.18    0  0.45799999999999996    4               0.06905   0.0  2.18    0  0.45799999999999996                       rm   age     dis rad  tax ptratio       b lstat  medv   0              6.575  65.2    4.09   1  296    15.3   396.9  4.98   Low   1              6.421  78.9  4.9671   2  242    17.8   396.9  9.14   Low   2              7.185  61.1  4.9671   2  242    17.8  392.83  4.03  High   3  6.997999999999999  45.8  6.0622   3  222    18.7  394.63  2.94  High   4              7.147  54.2  6.0622   3  222    18.7   396.9  5.33  High                      

35. How to create a dataframe with rows as strides from a given series?

Difficiulty Level: L3 Input

          L = pd.Series(range(15))                  

Desired Output

          array([[ 0,  1,  2,  3],        [ 2,  3,  4,  5],        [ 4,  5,  6,  7],        [ 6,  7,  8,  9],        [ 8,  9, 10, 11],        [10, 11, 12, 13]])                  

Show Solution

            L = pd.Series(range(15))  def gen_strides(a, stride_len=5, window_len=5):     n_strides = ((a.size-window_len)//stride_len) + 1     return np.array([a[s:(s+window_len)] for s in np.arange(0, a.size, stride_len)[:n_strides]])  gen_strides(L, stride_len=2, window_len=4)                      
            array([[ 0,  1,  2,  3],        [ 2,  3,  4,  5],        [ 4,  5,  6,  7],        [ 6,  7,  8,  9],        [ 8,  9, 10, 11],        [10, 11, 12, 13]])                      

36. How to import only specified columns from a csv file?

Difficulty Level: L1 Import 'crim' and 'medv' columns of the BostonHousing dataset as a dataframe. Show Solution

            df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/BostonHousing.csv', usecols=['crim', 'medv']) print(df.head())                      
                          crim  medv 0  0.00632  24.0 1  0.02731  21.6 2  0.02729  34.7 3  0.03237  33.4 4  0.06905  36.2                      

37. How to get the nrows, ncolumns, datatype, summary stats of each column of a dataframe? Also get the array and list equivalent.

Difficulty Level: L2 Get the number of rows, columns, datatype and summary statistics of each column of the Cars93 dataset. Also get the numpy array and list equivalent of the dataframe. Show Solution

            df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')  #  number of rows and columns print(df.shape)  # datatypes print(df.dtypes)  # how many columns under each dtype print(df.get_dtype_counts()) print(df.dtypes.value_counts())  # summary statistics df_stats = df.describe()  # numpy array  df_arr = df.values  # list df_list = df.values.tolist()                      
            (93, 27) Manufacturer           object Model                  object Type                   object Min.Price             float64 Price                 float64 Max.Price             float64 MPG.city              float64 MPG.highway           float64 AirBags                object DriveTrain             object Cylinders              object EngineSize            float64 Horsepower            float64 RPM                   float64 Rev.per.mile          float64 Man.trans.avail        object Fuel.tank.capacity    float64 Passengers            float64 Length                float64 Wheelbase             float64 Width                 float64 Turn.circle           float64 Rear.seat.room        float64 Luggage.room          float64 Weight                float64 Origin                 object Make                   object dtype: object float64    18 object      9 dtype: int64 float64    18 object      9 dtype: int64                      

38. How to extract the row and column number of a particular cell with given criterion?

Difficulty Level: L1 Input

          df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')                  

Which manufacturer, model and type has the highest Price? What is the row and column number of the cell with the highest Price value? Show Solution

            # Input df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')  # Solution # Get Manufacturer with highest price df.loc[df.Price == np.max(df.Price), ['Manufacturer', 'Model', 'Type']]  # Get Row and Column number row, col = np.where(df.values == np.max(df.Price))  # Get the value df.iat[row[0], col[0]] df.iloc[row[0], col[0]]  # Alternates df.at[row[0], 'Price'] df.get_value(row[0], 'Price')  # The difference between `iat` - `iloc` vs `at` - `loc` is: # `iat` snd `iloc` accepts row and column numbers.  # Whereas `at` and `loc` accepts index and column names.                      
            61.899999999999999                      

39. How to rename a specific columns in a dataframe?

Difficulty Level: L2 Rename the column Type as CarType in df and replace the '.' in column names with '_'. Input

          df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv') print(df.columns) #> Index(['Manufacturer', 'Model', 'Type', 'Min.Price', 'Price', 'Max.Price', #>        'MPG.city', 'MPG.highway', 'AirBags', 'DriveTrain', 'Cylinders', #>        'EngineSize', 'Horsepower', 'RPM', 'Rev.per.mile', 'Man.trans.avail', #>        'Fuel.tank.capacity', 'Passengers', 'Length', 'Wheelbase', 'Width', #>        'Turn.circle', 'Rear.seat.room', 'Luggage.room', 'Weight', 'Origin', #>        'Make'], #>       dtype='object')                  

Desired Solution

          print(df.columns) #> Index(['Manufacturer', 'Model', 'CarType', 'Min_Price', 'Price', 'Max_Price', #>        'MPG_city', 'MPG_highway', 'AirBags', 'DriveTrain', 'Cylinders', #>        'EngineSize', 'Horsepower', 'RPM', 'Rev_per_mile', 'Man_trans_avail', #>        'Fuel_tank_capacity', 'Passengers', 'Length', 'Wheelbase', 'Width', #>        'Turn_circle', 'Rear_seat_room', 'Luggage_room', 'Weight', 'Origin', #>        'Make'], #>       dtype='object')                  

Show Solution

            # Input df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')  # Solution # Step 1: df=df.rename(columns = {'Type':'CarType'}) # or df.columns.values[2] = "CarType"  # Step 2: df.columns = df.columns.map(lambda x: x.replace('.', '_')) print(df.columns)                      
            Index(['Manufacturer', 'Model', 'CarType', 'Min_Price', 'Price', 'Max_Price',        'MPG_city', 'MPG_highway', 'AirBags', 'DriveTrain', 'Cylinders',        'EngineSize', 'Horsepower', 'RPM', 'Rev_per_mile', 'Man_trans_avail',        'Fuel_tank_capacity', 'Passengers', 'Length', 'Wheelbase', 'Width',        'Turn_circle', 'Rear_seat_room', 'Luggage_room', 'Weight', 'Origin',        'Make'],       dtype='object')                      

40. How to check if a dataframe has any missing values?

Difficulty Level: L1 Check if df has any missing values. Input

          df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')                  

Show Solution

            # Input df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')  # Solution df.isnull().values.any()                      

41. How to count the number of missing values in each column?

Difficulty Level: L2 Count the number of missing values in each column of df. Which column has the maximum number of missing values? Input

          df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')                  

Show Solution

            # Input df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')  # Solution n_missings_each_col = df.apply(lambda x: x.isnull().sum()) n_missings_each_col.argmax()                      
            'Luggage.room'                      

42. How to replace missing values of multiple numeric columns with the mean?

Difficulty Level: L2 Replace missing values in Min.Price and Max.Price columns with their respective mean. Input

          df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')                  

Show Solution

            # Input df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')  # Solution df_out = df[['Min.Price', 'Max.Price']] = df[['Min.Price', 'Max.Price']].apply(lambda x: x.fillna(x.mean())) print(df_out.head())                      
                          Min.Price  Max.Price 0  12.900000  18.800000 1  29.200000  38.700000 2  25.900000  32.300000 3  17.118605  44.600000 4  17.118605  21.459091                      

43. How to use apply function on existing columns with global variables as additional arguments?

Difficulty Level: L3 In df, use apply method to replace the missing values in Min.Price with the column's mean and those in Max.Price with the column's median. Input

          df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')                  

Use Hint from StackOverflow Show Solution

            # Input df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')  # Solution d = {'Min.Price': np.nanmean, 'Max.Price': np.nanmedian} df[['Min.Price', 'Max.Price']] = df[['Min.Price', 'Max.Price']].apply(lambda x, d: x.fillna(d[x.name](x)), args=(d, ))                      

44. How to select a specific column from a dataframe as a dataframe instead of a series?

Difficulty Level: L2 Get the first column (a) in df as a dataframe (rather than as a Series). Input

          df = pd.DataFrame(np.arange(20).reshape(-1, 5), columns=list('abcde'))                  

Show Solution

            # Input df = pd.DataFrame(np.arange(20).reshape(-1, 5), columns=list('abcde'))  # Solution type(df[['a']]) type(df.loc[:, ['a']]) type(df.iloc[:, [0]])  # Alternately the following returns a Series type(df.a) type(df['a']) type(df.loc[:, 'a']) type(df.iloc[:, 1])                      
            pandas.core.series.Series                      

45. How to change the order of columns of a dataframe?

Difficulty Level: L3 Actually 3 questions.

  1. In df, interchange columns 'a' and 'c'.
  2. Create a generic function to interchange two columns, without hardcoding column names.
  3. Sort the columns in reverse alphabetical order, that is colume 'e' first through column 'a' last.

Input

          df = pd.DataFrame(np.arange(20).reshape(-1, 5), columns=list('abcde'))                  

Show Solution

            # Input df = pd.DataFrame(np.arange(20).reshape(-1, 5), columns=list('abcde'))  # Solution Q1 df[list('cbade')]  # Solution Q2 - No hard coding def switch_columns(df, col1=None, col2=None):     colnames = df.columns.tolist()     i1, i2 = colnames.index(col1), colnames.index(col2)     colnames[i2], colnames[i1] = colnames[i1], colnames[i2]     return df[colnames]  df1 = switch_columns(df, 'a', 'c')  # Solution Q3 df[sorted(df.columns)] # or df.sort_index(axis=1, ascending=False, inplace=True)                      

46. How to set the number of rows and columns displayed in the output?

Difficulty Level: L2 Change the pamdas display settings on printing the dataframe df it shows a maximum of 10 rows and 10 columns. Input

          df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')                  

Show Solution

            # Input df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')  # Solution pd.set_option('display.max_columns', 10) pd.set_option('display.max_rows', 10) # df  # Show all available options # pd.describe_option()                      

47. How to format or suppress scientific notations in a pandas dataframe?

Difficulty Level: L2 Suppress scientific notations like 'e-03' in df and print upto 4 numbers after decimal. Input

          df = pd.DataFrame(np.random.random(4)**10, columns=['random']) df #>          random #> 0  3.474280e-03 #> 1  3.951517e-05 #> 2  7.469702e-02 #> 3  5.541282e-28                  

Desired Output

          #>    random #> 0  0.0035 #> 1  0.0000 #> 2  0.0747 #> 3  0.0000                  

Show Solution

            # Input df = pd.DataFrame(np.random.random(4)**10, columns=['random'])  # Solution 1: Rounding df.round(4)  # Solution 2: Use apply to change format df.apply(lambda x: '%.4f' % x, axis=1) # or df.applymap(lambda x: '%.4f' % x)  # Solution 3: Use set_option pd.set_option('display.float_format', lambda x: '%.4f' % x)  # Solution 4: Assign display.float_format pd.options.display.float_format = '{:.4f}'.format print(df)  # Reset/undo float formatting pd.options.display.float_format = None                      
                          random 0  0.0002 1  0.5942 2  0.0000 3  0.0030                      

48. How to format all the values in a dataframe as percentages?

Difficulty Level: L2 Format the values in column 'random' of df as percentages. Input

          df = pd.DataFrame(np.random.random(4), columns=['random']) df #>      random #> 0    .689723 #> 1    .957224 #> 2    .159157 #> 3    .21082                  

Desired Output

          #>      random #> 0    68.97% #> 1    95.72% #> 2    15.91% #> 3    2.10%                  

Show Solution

            # Input df = pd.DataFrame(np.random.random(4), columns=['random'])  # Solution out = df.style.format({     'random': '{0:.2%}'.format, })  out                      
random
0 21.66%
1 44.90%
2 85.69%
3 92.12%

49. How to filter every nth row in a dataframe?

Difficulty Level: L1 From df, filter the 'Manufacturer', 'Model' and 'Type' for every 20th row starting from 1st (row 0). Input

          df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')                  

Show Solution

            # Input df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')  # Solution print(df.iloc[::20, :][['Manufacturer', 'Model', 'Type']])                      
                          Manufacturer    Model     Type 0         Acura  Integra    Small 20     Chrysler  LeBaron  Compact 40        Honda  Prelude   Sporty 60      Mercury   Cougar  Midsize 80       Subaru   Loyale    Small                      

50. How to create a primary key index by combining relevant columns?

Difficulty Level: L2 In df, Replace NaNs with 'missing' in columns 'Manufacturer', 'Model' and 'Type' and create a index as a combination of these three columns and check if the index is a primary key. Input

          df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv', usecols=[0,1,2,3,5])                  

Desired Output

                      Manufacturer    Model     Type  Min.Price  Max.Price Acura_Integra_Small           Acura  Integra    Small       12.9       18.8 missing_Legend_Midsize      missing   Legend  Midsize       29.2       38.7 Audi_90_Compact                Audi       90  Compact       25.9       32.3 Audi_100_Midsize               Audi      100  Midsize        NaN       44.6 BMW_535i_Midsize                BMW     535i  Midsize        NaN        NaN                  

Show Solution

            # Input df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv', usecols=[0,1,2,3,5])  # Solution df[['Manufacturer', 'Model', 'Type']] = df[['Manufacturer', 'Model', 'Type']].fillna('missing') df.index = df.Manufacturer + '_' + df.Model + '_' + df.Type print(df.index.is_unique)                      
            True                      

51. How to get the row number of the nth largest value in a column?

Difficulty Level: L2 Find the row position of the 5th largest value of column 'a' in df. Input

          df = pd.DataFrame(np.random.randint(1, 30, 30).reshape(10,-1), columns=list('abc'))                  

Show Solution

            # Input df = pd.DataFrame(np.random.randint(1, 30, 30).reshape(10,-1), columns=list('abc'))  # Solution n = 5 df['a'].argsort()[::-1][n]                      
                          a   b   c 0  27   7  25 1   8   4  20 2   1   7  17 3  24   9  17 4  21  15   9 5  21  16  20 6  19  27  25 7  12   8  20 8  11  16  28 9  24  13   4  4                      

52. How to find the position of the nth largest value greater than a given value?

Difficulty Level: L2 In ser, find the position of the 2nd largest value greater than the mean. Input

          ser = pd.Series(np.random.randint(1, 100, 15))                  

Show Solution

            # Input ser = pd.Series(np.random.randint(1, 100, 15))  # Solution print('ser: ', ser.tolist(), 'mean: ', round(ser.mean())) np.argwhere(ser > ser.mean())[1]                      
            ser:  [7, 77, 16, 86, 60, 38, 34, 36, 83, 27, 16, 52, 50, 52, 54] mean:  46  array([3])                      

53. How to get the last n rows of a dataframe with row sum > 100?

Difficulty Level: L2 Get the last two rows of df whose row sum is greater than 100.

          df = pd.DataFrame(np.random.randint(10, 40, 60).reshape(-1, 4))                  

Show Solution

            # Input df = pd.DataFrame(np.random.randint(10, 40, 60).reshape(-1, 4))  # Solution # print row sums rowsums = df.apply(np.sum, axis=1)  # last two rows with row sum greater than 100 last_two_rows = df.iloc[np.where(rowsums > 100)[0][-2:], :]                      

54. How to find and cap outliers from a series or dataframe column?

Difficulty Level: L2 Replace all values of ser in the lower 5%ile and greater than 95%ile with respective 5th and 95th %ile value. Input

          ser = pd.Series(np.logspace(-2, 2, 30))                  

Show Solution

            # Input ser = pd.Series(np.logspace(-2, 2, 30))  # Solution def cap_outliers(ser, low_perc, high_perc):     low, high = ser.quantile([low_perc, high_perc])     print(low_perc, '%ile: ', low, '|', high_perc, '%ile: ', high)     ser[ser < low] = low     ser[ser > high] = high     return(ser)  capped_ser = cap_outliers(ser, .05, .95)                      
            0.05 %ile:  0.016049294077 | 0.95 %ile:  63.8766722202                      

55. How to reshape a dataframe to the largest possible square after removing the negative values?

Difficulty Level: L3 Reshape df to the largest possible square with negative values removed. Drop the smallest values if need be. The order of the positive numbers in the result should remain the same as the original. Input

          df = pd.DataFrame(np.random.randint(-20, 50, 100).reshape(10,-1))                  

Show Solution

            # Input df = pd.DataFrame(np.random.randint(-20, 50, 100).reshape(10,-1)) print(df)  # Solution # Step 1: remove negative values from arr arr = df[df > 0].values.flatten() arr_qualified = arr[~np.isnan(arr)]  # Step 2: find side-length of largest possible square n = int(np.floor(arr_qualified.shape[0]**.5))  # Step 3: Take top n^2 items without changing positions top_indexes = np.argsort(arr_qualified)[::-1] output = np.take(arr_qualified, sorted(top_indexes[:n**2])).reshape(n, -1) print(output)                      
                          0   1   2   3   4   5   6   7   8   9 0  25 -13  17  16   0   6  22  44  10 -19 1  47   4  -1  29 -13  12  41 -13  49  42 2  20 -20   9  16 -17  -1  37  39  41  37 3  27  44  -5   5   3 -12   0 -13  23  45 4   8  27  -8  -3  48 -16  -5  40  16  10 5  12  12  41 -12   3 -17  -3  27 -15  -1 6  -9  -3  41 -13   1   0  28  33  -2  18 7  18 -14  35   5   4  14   4  44  14  34 8   1  24  26  28 -10  17 -14  14  38  17 9  13  12   5   9 -16  -7  12 -18   1  24 [[ 25.  17.  16.   6.  22.  44.  10.  47.]  [  4.  29.  12.  41.  49.  42.  20.   9.]  [ 16.  37.  39.  41.  37.  27.  44.   5.]  [  3.  23.  45.   8.  27.  48.  40.  16.]  [ 10.  12.  12.  41.   3.  27.  41.  28.]  [ 33.  18.  18.  35.   5.   4.  14.   4.]  [ 44.  14.  34.  24.  26.  28.  17.  14.]  [ 38.  17.  13.  12.   5.   9.  12.  24.]]                      

56. How to swap two rows of a dataframe?

Difficulty Level: L2 Swap rows 1 and 2 in df. Input

          df = pd.DataFrame(np.arange(25).reshape(5, -1))                  

Show Solution

            # Input df = pd.DataFrame(np.arange(25).reshape(5, -1))  # Solution def swap_rows(df, i1, i2):     a, b = df.iloc[i1, :].copy(), df.iloc[i2, :].copy()     df.iloc[i1, :], df.iloc[i2, :] = b, a     return df  print(swap_rows(df, 1, 2))                      
                          0   1   2   3   4 0   0   1   2   3   4 1  10  11  12  13  14 2   5   6   7   8   9 3  15  16  17  18  19 4  20  21  22  23  24                      

57. How to reverse the rows of a dataframe?

Difficulty Level: L2 Reverse all the rows of dataframe df. Input

          df = pd.DataFrame(np.arange(25).reshape(5, -1))                  

Show Solution

            # Input df = pd.DataFrame(np.arange(25).reshape(5, -1))  # Solution 1 df.iloc[::-1, :]  # Solution 2 print(df.loc[df.index[::-1], :])                      
                          0   1   2   3   4 4  20  21  22  23  24 3  15  16  17  18  19 2  10  11  12  13  14 1   5   6   7   8   9 0   0   1   2   3   4                      

58. How to create one-hot encodings of a categorical variable (dummy variables)?

Difficulty Level: L2 Get one-hot encodings for column 'a' in the dataframe df and append it as columns. Input

          df = pd.DataFrame(np.arange(25).reshape(5,-1), columns=list('abcde'))     a   b   c   d   e 0   0   1   2   3   4 1   5   6   7   8   9 2  10  11  12  13  14 3  15  16  17  18  19 4  20  21  22  23  24                  

Output

                      0  5  10  15  20   b   c   d   e 0  1  0   0   0   0   1   2   3   4 1  0  1   0   0   0   6   7   8   9 2  0  0   1   0   0  11  12  13  14 3  0  0   0   1   0  16  17  18  19 4  0  0   0   0   1  21  22  23  24                  

Show Solution

            # Input df = pd.DataFrame(np.arange(25).reshape(5,-1), columns=list('abcde'))  # Solution df_onehot = pd.concat([pd.get_dummies(df['a']), df[list('bcde')]], axis=1) print(df_onehot)                      
                          a   b   c   d   e 0   0   1   2   3   4 1   5   6   7   8   9 2  10  11  12  13  14 3  15  16  17  18  19 4  20  21  22  23  24    0  5  10  15  20   b   c   d   e 0  1  0   0   0   0   1   2   3   4 1  0  1   0   0   0   6   7   8   9 2  0  0   1   0   0  11  12  13  14 3  0  0   0   1   0  16  17  18  19 4  0  0   0   0   1  21  22  23  24                      

59. Which column contains the highest number of row-wise maximum values?

Difficulty Level: L2 Obtain the column name with the highest number of row-wise maximum's in df.

          df = pd.DataFrame(np.random.randint(1,100, 40).reshape(10, -1))                  

Show Solution

            # Input df = pd.DataFrame(np.random.randint(1,100, 40).reshape(10, -1))  # Solution print('Column with highest row maxes: ', df.apply(np.argmax, axis=1).value_counts().index[0])                      
            Column with highest row maxes:  2                      

60. How to create a new column that contains the row number of nearest column by euclidean distance?

Create a new column such that, each row contains the row number of nearest row-record by euclidean distance. Difficulty Level: L3 Input

          df = pd.DataFrame(np.random.randint(1,100, 40).reshape(10, -1), columns=list('pqrs'), index=list('abcdefghij')) df #     p   q   r   s # a  57  77  13  62 # b  68   5  92  24 # c  74  40  18  37 # d  80  17  39  60 # e  93  48  85  33 # f  69  55   8  11 # g  39  23  88  53 # h  63  28  25  61 # i  18   4  73   7 # j  79  12  45  34                  

Desired Output

          df #    p   q   r   s nearest_row   dist # a  57  77  13  62           i  116.0 # b  68   5  92  24           a  114.0 # c  74  40  18  37           i   91.0 # d  80  17  39  60           i   89.0 # e  93  48  85  33           i   92.0 # f  69  55   8  11           g  100.0 # g  39  23  88  53           f  100.0 # h  63  28  25  61           i   88.0 # i  18   4  73   7           a  116.0 # j  79  12  45  34           a   81.0                  

Show Solution

            df = pd.DataFrame(np.random.randint(1,100, 40).reshape(10, -1), columns=list('pqrs'), index=list('abcdefghij'))  # Solution import numpy as np  # init outputs nearest_rows = [] nearest_distance = []  # iterate rows. for i, row in df.iterrows():     curr = row     rest = df.drop(i)     e_dists = {}  # init dict to store euclidean dists for current row.     # iterate rest of rows for current row     for j, contestant in rest.iterrows():         # compute euclidean dist and update e_dists         e_dists.update({j: round(np.linalg.norm(curr.values - contestant.values))})     # update nearest row to current row and the distance value     nearest_rows.append(max(e_dists, key=e_dists.get))     nearest_distance.append(max(e_dists.values()))  df['nearest_row'] = nearest_rows df['dist'] = nearest_distance                      

61. How to know the maximum possible correlation value of each column against other columns?

Difficulty Level: L2 Compute maximum possible absolute correlation value of each column against other columns in df. Input

          df = pd.DataFrame(np.random.randint(1,100, 80).reshape(8, -1), columns=list('pqrstuvwxy'), index=list('abcdefgh'))                  

Show Solution

            # Input df = pd.DataFrame(np.random.randint(1,100, 80).reshape(8, -1), columns=list('pqrstuvwxy'), index=list('abcdefgh')) df  # Solution abs_corrmat = np.abs(df.corr()) max_corr = abs_corrmat.apply(lambda x: sorted(x)[-2]) print('Maximum Correlation possible for each column: ', np.round(max_corr.tolist(), 2))                      
            Maximum Correlation possible for each column:  [ 0.91  0.57  0.55  0.71  0.53  0.26  0.91  0.71  0.69  0.71]                      

62. How to create a column containing the minimum by maximum of each row?

Difficulty Level: L2 Compute the minimum-by-maximum for every row of df.

          df = pd.DataFrame(np.random.randint(1,100, 80).reshape(8, -1))                  

Show Solution

            # Input df = pd.DataFrame(np.random.randint(1,100, 80).reshape(8, -1))  # Solution 1 min_by_max = df.apply(lambda x: np.min(x)/np.max(x), axis=1)  # Solution 2 min_by_max = np.min(df, axis=1)/np.max(df, axis=1)                      

63. How to create a column that contains the penultimate value in each row?

Difficulty Level: L2 Create a new column 'penultimate' which has the second largest value of each row of df. Input

          df = pd.DataFrame(np.random.randint(1,100, 80).reshape(8, -1))                  

Show Solution

            # Input df = pd.DataFrame(np.random.randint(1,100, 80).reshape(8, -1))  # Solution out = df.apply(lambda x: x.sort_values().unique()[-2], axis=1) df['penultimate'] = out print(df)                      
                          0   1   2   3   4   5   6   7   8   9  penultimate 0  52  69  62   7  20  69  38  10  57  17           62 1  52  94  49  63   1  90  14  76  20  84           90 2  78  37  58   7  27  41  27  26  48  51           58 3   6  39  99  36  62  90  47  25  60  84           90 4  37  36  91  93  76  69  86  95  69   6           93 5   5  54  73  61  22  29  99  27  46  24           73 6  71  65  45   9  63  46   4  93  36  18           71 7  85   7  76  46  65  97  64  52  28  80           85                      

64. How to normalize all columns in a dataframe?

Difficulty Level: L2

  1. Normalize all columns of df by subtracting the column mean and divide by standard deviation.
  2. Range all columns of df such that the minimum value in each column is 0 and max is 1.

Don't use external packages like sklearn. Input

          df = pd.DataFrame(np.random.randint(1,100, 80).reshape(8, -1))                  

Show Solution

            # Input df = pd.DataFrame(np.random.randint(1,100, 80).reshape(8, -1))  # Solution Q1 out1 = df.apply(lambda x: ((x - x.mean())/x.std()).round(2)) print('Solution Q1\n',out1)  # Solution Q2 out2 = df.apply(lambda x: ((x.max() - x)/(x.max() - x.min())).round(2)) print('Solution Q2\n', out2)                      
            Solution Q1        0     1     2     3     4     5     6     7     8     9 0  1.09  0.64 -0.33 -0.96 -1.30  0.06  0.38  1.18 -1.60  1.66 1 -0.93 -2.36  0.87  1.47 -1.15  1.27  0.07 -0.87 -0.18  0.23 2  1.53  0.48 -0.90  0.18 -0.33  0.81 -1.29  0.34  0.06 -0.55 3  0.59 -0.24 -1.06  0.61  1.18 -1.23 -0.53 -0.45  0.34 -1.25 4  0.18  0.33  1.07  1.17  0.50 -0.26 -0.25 -1.45  1.11  1.11 5 -1.16  0.64 -0.93 -0.59 -0.15  0.63  1.02  1.13  1.20 -0.19 6 -0.58  0.07 -0.20 -0.87 -0.22 -1.62 -1.04  0.81 -1.23 -1.04 7 -0.73  0.45  1.47 -1.02  1.47  0.34  1.65 -0.71  0.31  0.02 Solution Q2        0     1     2     3     4     5     6     7     8     9 0  0.16  0.00  0.71  0.98  1.00  0.42  0.43  0.00  1.00  0.00 1  0.91  1.00  0.24  0.00  0.95  0.00  0.54  0.78  0.49  0.49 2  0.00  0.05  0.93  0.52  0.65  0.16  1.00  0.32  0.41  0.76 3  0.35  0.29  1.00  0.35  0.10  0.86  0.74  0.62  0.31  1.00 4  0.50  0.10  0.16  0.12  0.35  0.53  0.65  1.00  0.03  0.19 5  1.00  0.00  0.95  0.83  0.58  0.22  0.22  0.02  0.00  0.64 6  0.78  0.19  0.66  0.94  0.61  1.00  0.91  0.14  0.87  0.93 7  0.84  0.06  0.00  1.00  0.00  0.32  0.00  0.72  0.32  0.56                      

65. How to compute the correlation of each row with the suceeding row?

Difficulty Level: L2 Compute the correlation of each row of df with its succeeding row. Input

          df = pd.DataFrame(np.random.randint(1,100, 80).reshape(8, -1))                  

Show Solution

            # Input df = pd.DataFrame(np.random.randint(1,100, 80).reshape(8, -1))  # Solution [df.iloc[i].corr(df.iloc[i+1]).round(2) for i in range(df.shape[0])[:-1]]                      
                          0   1   2   3   4   5   6   7   8   9 0  93  49  26   2  96  56  11  73  90  65 1  54  17  47  52  65   9  21  87  94   4 2  51  11  44  77  37  57  17  25  95  26 3  84   8  61  43  63  63  59  65  69  29 4   8  27  53  95  10  35  16  61  39  83 5  30  70  91  26  12  44  37  71  21  48 6  66  44  47  44  29  99  86  78  31   1 7  17  40  28  12  89  95  79  54  81  47  [0.40999999999999998,  0.47999999999999998,  0.42999999999999999,  -0.37,  0.23000000000000001,  0.14000000000000001,  0.22]                      

66. How to replace both the diagonals of dataframe with 0?

Difficulty Level: L2 Replace both values in both diagonals of df with 0. Input

          df = pd.DataFrame(np.random.randint(1,100, 100).reshape(10, -1)) df #     0   1   2   3   4   5   6   7   8   9 # 0  11  46  26  44  11  62  18  70  68  26 # 1  87  71  52  50  81  43  83  39   3  59 # 2  47  76  93  77  73   2   2  16  14  26 # 3  64  18  74  22  16  37  60   8  66  39 # 4  10  18  39  98  25   8  32   6   3  29 # 5  29  91  27  86  23  84  28  31  97  10 # 6  37  71  70  65   4  72  82  89  12  97 # 7  65  22  97  75  17  10  43  78  12  77 # 8  47  57  96  55  17  83  61  85  26  86 # 9  76  80  28  45  77  12  67  80   7  63                  

Desired output

          #     0   1   2   3   4   5   6   7   8   9 # 0   0  46  26  44  11  62  18  70  68   0 # 1  87   0  52  50  81  43  83  39   0  59 # 2  47  76   0  77  73   2   2   0  14  26 # 3  64  18  74   0  16  37   0   8  66  39 # 4  10  18  39  98   0   0  32   6   3  29 # 5  29  91  27  86   0   0  28  31  97  10 # 6  37  71  70   0   4  72   0  89  12  97 # 7  65  22   0  75  17  10  43   0  12  77 # 8  47   0  96  55  17  83  61  85   0  86 # 9   0  80  28  45  77  12  67  80   7   0                  

Show Solution

            # Input df = pd.DataFrame(np.random.randint(1,100, 100).reshape(10, -1))  # Solution for i in range(df.shape[0]):     df.iat[i, i] = 0     df.iat[df.shape[0]-i-1, i] = 0                      

67. How to get the particular group of a groupby dataframe by key?

Difficulty Level: L2 This is a question related to understanding of grouped dataframe. From df_grouped, get the group belonging to 'apple' as a dataframe. Input

          df = pd.DataFrame({'col1': ['apple', 'banana', 'orange'] * 3,                    'col2': np.random.rand(9),                    'col3': np.random.randint(0, 15, 9)})  df_grouped = df.groupby(['col1'])                  
          # Input df = pd.DataFrame({'col1': ['apple', 'banana', 'orange'] * 3,                    'col2': np.random.rand(9),                    'col3': np.random.randint(0, 15, 9)})  df_grouped = df.groupby(['col1'])  # Solution 1 df_grouped.get_group('apple')  # Solution 2 for i, dff in df_grouped:     if i == 'apple':         print(dff)                  
                      col1      col2  col3 0  apple  0.673434     7 3  apple  0.182348    14 6  apple  0.050457     3                  

[/expand]

68. How to get the n'th largest value of a column when grouped by another column?

Difficulty Level: L2 In df, find the second largest value of 'taste' for 'banana' Input

          df = pd.DataFrame({'fruit': ['apple', 'banana', 'orange'] * 3,                    'rating': np.random.rand(9),                    'price': np.random.randint(0, 15, 9)})                  
                            

Show Solution

            # Input df = pd.DataFrame({'fruit': ['apple', 'banana', 'orange'] * 3,                    'taste': np.random.rand(9),                    'price': np.random.randint(0, 15, 9)})  print(df)  # Solution df_grpd = df['taste'].groupby(df.fruit) df_grpd.get_group('banana').sort_values().iloc[-2]                      
                          fruit  price     taste 0   apple      7  0.190229 1  banana      2  0.438063 2  orange      1  0.860182 3   apple      6  0.042149 4  banana      2  0.896021 5  orange      5  0.255107 6   apple      6  0.874533 7  banana      4  0.696274 8  orange      9  0.140713  0.69627423645996078                      

69. How to compute grouped mean on pandas dataframe and keep the grouped column as another column (not index)?

Difficulty Level: L1 In df, Compute the mean price of every fruit, while keeping the fruit as another column instead of an index. Input

          df = pd.DataFrame({'fruit': ['apple', 'banana', 'orange'] * 3,                    'rating': np.random.rand(9),                    'price': np.random.randint(0, 15, 9)})                  
                            

Show Solution

            # Input df = pd.DataFrame({'fruit': ['apple', 'banana', 'orange'] * 3,                    'rating': np.random.rand(9),                    'price': np.random.randint(0, 15, 9)})  # Solution out = df.groupby('fruit', as_index=False)['price'].mean() print(out)                      
                          fruit      price 0   apple  11.000000 1  banana   6.333333 2  orange   6.333333                      

70. How to join two dataframes by 2 columns so they have only the common rows?

Difficulty Level: L2 Join dataframes df1 and df2 by 'fruit-pazham' and 'weight-kilo'. Input

          df1 = pd.DataFrame({'fruit': ['apple', 'banana', 'orange'] * 3,                     'weight': ['high', 'medium', 'low'] * 3,                     'price': np.random.randint(0, 15, 9)})  df2 = pd.DataFrame({'pazham': ['apple', 'orange', 'pine'] * 2,                     'kilo': ['high', 'low'] * 3,                     'price': np.random.randint(0, 15, 6)})                  

Show Solution

            # Input df1 = pd.DataFrame({'fruit': ['apple', 'banana', 'orange'] * 3,                     'weight': ['high', 'medium', 'low'] * 3,                     'price': np.random.randint(0, 15, 9)})  df2 = pd.DataFrame({'pazham': ['apple', 'orange', 'pine'] * 2,                     'kilo': ['high', 'low'] * 3,                     'price': np.random.randint(0, 15, 6)})  # Solution pd.merge(df1, df2, how='inner', left_on=['fruit', 'weight'], right_on=['pazham', 'pounds'], suffixes=['_left', '_right'])                      
fruit price_left weight pazham pounds price_right
0 apple 5 high apple high 11
1 apple 10 high apple high 11
2 apple 8 high apple high 11
3 orange 6 low orange low 6
4 orange 7 low orange low 6
5 orange 0 low orange low 6

71. How to remove rows from a dataframe that are present in another dataframe?

Difficulty Level: L3 From df1, remove the rows that are present in df2. All three columns must be the same. Input

            df1 = pd.DataFrame({'fruit': ['apple', 'banana', 'orange'] * 3,                     'weight': ['high', 'medium', 'low'] * 3,                     'price': np.random.randint(0, 15, 9)})  df2 = pd.DataFrame({'pazham': ['apple', 'orange', 'pine'] * 2,                     'kilo': ['high', 'low'] * 3,                     'price': np.random.randint(0, 15, 6)})                      

Show Solution

            # Input df1 = pd.DataFrame({'fruit': ['apple', 'orange', 'banana'] * 3,                     'weight': ['high', 'medium', 'low'] * 3,                     'price': np.arange(9)})  df2 = pd.DataFrame({'fruit': ['apple', 'orange', 'pine'] * 2,                     'weight': ['high', 'medium'] * 3,                     'price': np.arange(6)})   # Solution print(df1[~df1.isin(df2).all(1)])                      
                          fruit  price  weight 2  banana      2     low 3   apple      3    high 4  orange      4  medium 5  banana      5     low 6   apple      6    high 7  orange      7  medium 8  banana      8     low                      

72. How to get the positions where values of two columns match?

Difficulty Level: L2 Show Solution

            # Input df = pd.DataFrame({'fruit1': np.random.choice(['apple', 'orange', 'banana'], 10),                     'fruit2': np.random.choice(['apple', 'orange', 'banana'], 10)})  # Solution np.where(df.fruit1 == df.fruit2)                      
            (array([1, 5, 9]),)                      

73. How to create lags and leads of a column in a dataframe?

Difficulty Level: L2 Create two new columns in df, one of which is a lag1 (shift column a down by 1 row) of column 'a' and the other is a lead1 (shift column b up by 1 row). Input

          df = pd.DataFrame(np.random.randint(1, 100, 20).reshape(-1, 4), columns = list('abcd'))      a   b   c   d 0  66  34  76  47 1  20  86  10  81 2  75  73  51  28 3   1   1   9  83 4  30  47  67   4                  

Desired Output

                      a   b   c   d  a_lag1  b_lead1 0  66  34  76  47     NaN     86.0 1  20  86  10  81    66.0     73.0 2  75  73  51  28    20.0      1.0 3   1   1   9  83    75.0     47.0 4  30  47  67   4     1.0      NaN                  

Show Solution

            # Input df = pd.DataFrame(np.random.randint(1, 100, 20).reshape(-1, 4), columns = list('abcd'))  # Solution df['a_lag1'] = df['a'].shift(1) df['b_lead1'] = df['b'].shift(-1) print(df)                      
                          a   b   c   d  a_lag1  b_lead1 0  29  90  43  24     NaN     36.0 1  94  36  67  66    29.0     76.0 2  81  76  44  49    94.0     97.0 3  55  97  10  74    81.0     43.0 4  32  43  62  62    55.0      NaN                      

74. How to get the frequency of unique values in the entire dataframe?

Difficulty Level: L2 Get the frequency of unique values in the entire dataframe df. Input

          df = pd.DataFrame(np.random.randint(1, 10, 20).reshape(-1, 4), columns = list('abcd'))                  

Show Solution

            # Input df = pd.DataFrame(np.random.randint(1, 10, 20).reshape(-1, 4), columns = list('abcd'))  # Solution pd.value_counts(df.values.ravel())                      
            5    4 4    4 9    3 8    2 7    2 3    2 6    1 2    1 1    1 dtype: int64                      

75. How to split a text column into two separate columns?

Difficulty Level: L2 Split the string column in df to form a dataframe with 3 columns as shown. Input

          df = pd.DataFrame(["STD, City    State", "33, Kolkata    West Bengal", "44, Chennai    Tamil Nadu", "40, Hyderabad    Telengana", "80, Bangalore    Karnataka"], columns=['row'])  print(df) #>                         row #> 0          STD, City\tState #> 1  33, Kolkata\tWest Bengal #> 2   44, Chennai\tTamil Nadu #> 3  40, Hyderabad\tTelengana #> 4  80, Bangalore\tKarnataka                  

Desired Output

          0 STD        City        State 1  33     Kolkata  West Bengal 2  44     Chennai   Tamil Nadu 3  40   Hyderabad    Telengana 4  80   Bangalore    Karnataka                  

Show Solution

            # Input df = pd.DataFrame(["STD, City    State", "33, Kolkata    West Bengal", "44, Chennai    Tamil Nadu", "40, Hyderabad    Telengana", "80, Bangalore    Karnataka"], columns=['row'])  # Solution df_out = df.row.str.split(',|\t', expand=True)  # Make first row as header new_header = df_out.iloc[0] df_out = df_out[1:] df_out.columns = new_header print(df_out)                      
            0 STD        City        State 1  33     Kolkata  West Bengal 2  44     Chennai   Tamil Nadu 3  40   Hyderabad    Telengana 4  80   Bangalore    Karnataka                      

To be continued . .

Find Oh for the Fourth Row Solution

Source: https://www.machinelearningplus.com/python/101-pandas-exercises-python/