NPM.js is a package manager that is essential to the node.js community. As of September 22nd, it had over 99,000 packages hosted, a number that grew by 10,000 packages in the last two months alone.
As the node.js package ecosystem grows, it is essential to understand the behaviour of the system: how many packages and what packages are crucial to the community, how the ecosystem evolves over time and can we predict the success of a package based on its metrics alone.
3% are greater than 1.*.
all_packages['major_version_category'].value_counts(normalize=True)
version_major_zero 0.822502
version_major_one 0.143654
version_major_gt_one 0.033843
The median value for number of releases is 3 releases
all_packages.version_count.median()
Packages whose major version is greater than 1 have a higher median of 7.
code:
all_packages.groupby(['major_version_category'])['version_count'].median()
version_major_gt_one 7
version_major_one 3
version_major_zero 3
The higher the major version, the higher the number of releases a package has on average
sns.barplot("major_version_category", "version_count", data=all_packages);
The package with the most versions is apostrophe
with 433 releases! It's current version number is 0.5.197
all_packages[all_packages['version_count'] == all_packages.version_count.max()]
21% of packages have been updated between a year and two years ago
all_packages['updated_category'].value_counts(normalize=True)
within_last_year 0.706571
havent_been_updated_in_1_year 0.208503
havent_been_updated_in_2_year 0.068142
havent_been_updated_in_3_year 0.016784
Packages that have a higher version number tend to be updated more frequently. Days since last modified median by version:
code:
all_packages.groupby(['major_version_category'])['deltaSinceModifiedDays'].median()
version_major_gt_one 112
version_major_one 118
version_major_zero 210
More packages whose major version is > 1.* have been updated within the past year (87%) than those whose version is 0.* (68%). Having said that, there were 53,180 packages whose version is 0.* that have been updated in the last year, versus only 2,771 whose version is greater than 1.*.
cats = all_packages['major_version_category'].unique()
for c in cats:
print "=> " + c
sub = all_packages[all_packages['major_version_category'] == c]
print sub.groupby(['updated_category'])['version_count'].count() / float(len(sub))
=> version_major_zero
updated_category
havent_been_updated_in_1_year 0.222227
havent_been_updated_in_2_year 0.074814
havent_been_updated_in_3_year 0.018295
within_last_year 0.684665
Name: version_count, dtype: float64
=> version_major_one
updated_category
havent_been_updated_in_1_year 0.155831
havent_been_updated_in_2_year 0.039437
havent_been_updated_in_3_year 0.010541
within_last_year 0.794191
Name: version_count, dtype: float64
=> version_major_gt_one
updated_category
havent_been_updated_in_1_year 0.098561
havent_been_updated_in_2_year 0.027534
havent_been_updated_in_3_year 0.006884
within_last_year 0.867021
Name: version_count, dtype: float64
28,322 packages (~30%) have been updated in the last 3 months alone.
len(all_packages[all_packages['deltaSinceModifiedDays'] < 365/4.0]) / float(len(all_packages))
0.29990999099909993
In the past year alone, there have been 54,051 packages that were added to npm. This is 57% of all packages on npm.
len(all_packages[all_packages.age < 365]) / float(len(all_packages))
0.5723619420765605
Regardless of the age, most packages aren't past major version 0.*. Over 80% of packages in every age bucket are 0.*.
cats = all_packages['age_category'].unique()
for c in cats:
print "=> " + c
sub = all_packages[all_packages['age_category'] == c]
print sub.groupby(['major_version_category'])['package'].count() / float(len(sub))
=> age_0.5_year
major_version_category
version_major_gt_one 0.026615
version_major_one 0.149802
version_major_zero 0.823583
Name: package, dtype: float64
=> age_2_year
major_version_category
version_major_gt_one 0.040523
version_major_one 0.129075
version_major_zero 0.830364
Name: package, dtype: float64
=> age_3_year
major_version_category
version_major_gt_one 0.041274
version_major_one 0.123679
version_major_zero 0.835047
Name: package, dtype: float64
=> age_1_year
major_version_category
version_major_gt_one 0.032167
version_major_one 0.136044
version_major_zero 0.831789
Name: package, dtype: float64
=> age_0.25_year
major_version_category
version_major_gt_one 0.025377
version_major_one 0.192218
version_major_zero 0.782405
The semver specification gets applied fairly liberaly to package versioning. Some are very careful to bump even minor versions, while others speed along past the infamous 1.*, never to look back. Looking at the relationship between package age and its major version, there is no correlation between the two.
s = all_packages
x = 'version_major'
y = 'age'
mod = ols(formula='version_major ~ age', data=s)
res = mod.fit()
print res.summary()
OLS Regression Results
==============================================================================
Dep. Variable: version_major R-squared: 0.000
Model: OLS Adj. R-squared: -0.000
Method: Least Squares F-statistic: 0.09118
Date: Wed, 24 Sep 2014 Prob (F-statistic): 0.763
Time: 11:48:31 Log-Likelihood: -98690.
No. Observations: 94435 AIC: 1.974e+05
Df Residuals: 94433 BIC: 1.974e+05
Df Model: 1
==============================================================================
coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept 0.2340 0.004 65.640 0.000 0.227 0.241
age -2.175e-06 7.2e-06 -0.302 0.763 -1.63e-05 1.19e-05
==============================================================================
Omnibus: 190799.917 Durbin-Watson: 1.736
Prob(Omnibus): 0.000 Jarque-Bera (JB): 2537154961.582
Skew: 16.109 Prob(JB): 0.00
Kurtosis: 805.348 Cond. No. 788.
==============================================================================
Deep dependents is a value representing the number of packages that depend on another packge directly and indirectly (through other packages.)
95% of packages on npm have less than 8 packages depending on them.
q95 = all_packages['deep_dependents'].quantile(q=.95)
8.0
That leaves us with 4,477 (5%) of packages that represent the most dependent upon packages on npm.
most_dependent_upon = all_packages[all_packages['deep_dependents'] > q95]
len(most_dependent_upon)
The top 5% of most dependend on packages exhibit a clear trend: The older the package is, the more likely it is to be in that top 5%. Packages that have been created in the past year comprise only ~12% of the top 5%, while packages created 1-2 years ago are ~18% of the top 5% with the remaining 70% of the top 5% being older than a year!
most_dependent_upon = all_packages[all_packages['deep_dependents'] > q95]
most_dependent_upon.groupby(['age_category'])['package'].count() / len(most_dependent_upon)
age_category
age_0.25_year 0.041099
age_0.5_year 0.082868
age_1_year 0.186285
age_2_year 0.342864
age_3_year 0.346661
The majority of packages on npm, 70,821 (75%), have no packages depending on them.
len(all_packages[all_packages['deep_dependents'] == 0]) / float(len(all_packages))
0.7499444062053264
In the top 5% of dependend upon packages, older packages have slightly more dependents, but not much. Past year: 1,317 (median 20), the year before: 1,866 (median 24) and the before that: 1,931 (median 33). Age is not a significant predictor of dependency.
thisyear = most_dependent_upon[most_dependent_upon['age'] < 365]
lastyear = most_dependent_upon[(most_dependent_upon['age'] >= 365) & (most_dependent_upon['age'] < 2 * 365)]
yearbefore = most_dependent_upon[(most_dependent_upon['age'] >= 365 * 2)]
print len(thisyear), len(lastyear), len(yearbefore)
print thisyear['deep_dependents'].median(), lastyear['deep_dependents'].median(),yearbefore['deep_dependents'].median()
NPM packages are not the most collaborative of endeavours: Most packages (88,132, 93%) on npm have only one maintainer (although contributors are not accounted for.)
print len(all_packages[all_packages['maintainer_count'] == 1])
print len(all_packages[all_packages['maintainer_count'] == 1]) / float(len(all_packages))
There is no relationship between how many dependents a package has and how many maintainers it has. Turns out, it doesn't take a village to raise a successful npm package.
s = all_packages[all_packages['deep_dependents'] > 99]
x = 'maintainer_count'
y = 'deep_dependents'
mod = ols(formula='deep_dependents ~ maintainer_count', data=s)
res = mod.fit()
print res.summary()
f, (ax1, ax2) = plt.subplots(1, 2, sharey=True)
sns.regplot(x, y, s, ax=ax1)
ax1.set(xlabel=x, ylabel=y);
sns.residplot(x, y, s, color="seagreen", ax=ax2)
OLS Regression Results
==============================================================================
Dep. Variable: deep_dependents R-squared: 0.000
Model: OLS Adj. R-squared: -0.001
Method: Least Squares F-statistic: 0.1094
Date: Thu, 25 Sep 2014 Prob (F-statistic): 0.741
Time: 13:43:43 Log-Likelihood: -10201.
No. Observations: 1072 AIC: 2.041e+04
Df Residuals: 1070 BIC: 2.042e+04
Df Model: 1
====================================================================================
coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------------
Intercept 1731.1602 124.380 13.918 0.000 1487.104 1975.217
maintainer_count -12.6690 38.308 -0.331 0.741 -87.836 62.498
==============================================================================
Omnibus: 861.536 Durbin-Watson: 1.846
Prob(Omnibus): 0.000 Jarque-Bera (JB): 14296.422
Skew: 3.750 Prob(JB): 0.00
Kurtosis: 19.242 Cond. No. 4.16
==============================================================================